SimbaBench: State-of-the-Art African Speech Models & LLM Benchmark
Voice of a Continent: SimbaBench is a premier open-source project by the UBC NLP group, dedicated to advancing Artificial Intelligence for African languages. We provide a unified suite for benchmarking African Speech Technology and Large Language Models (LLMs).
Open Source Speech Models and Datasets
- Simba-ASR: High-performance Automatic Speech Recognition models for diverse African accents and dialects.
- Simba-TTS: Natural-sounding Text-to-Speech models for low-resource languages.
- Simba-SLID: Spoken Language Identification tools capable of distinguishing between 61 languages in real-time.
- The Dataset: A massive corpus containing 8,605 hours of curated audio data, fully compatible with Hugging Face.
Supported Languages
Our benchmark covers 61 languages across major language families (Niger-Congo, Afro-Asiatic, Nilo-Saharan), including high-demand languages such as:
Swahili (Kiswahili), Yoruba, Amharic, Hausa, Igbo, Zulu (isiZulu), Oromo, Somali, Twi, Wolof, and Lingala.
Academic Resources
Published at EMNLP 2025. All resources, code, and model weights are available freely on Hugging Face and GitHub to support the global AI research community.