UBC Logo

UBC: Deep Learning &
Natural Language Processing

NLP for Africa

Africa Logo

AfroLingu-MT: A New Benchmark for African Machine Translation

Part of the ACL 2024 Toucan Release: Empowering Low-Resource NLP for Africa

We're thrilled to announce the release of AfroLingu-MT - a powerful new benchmark for African machine translation, developed as part of the Toucan Project and presented at ACL 2024. This benchmark represents a bold step forward in building inclusive and equitable language technology for the African continent.

AfroLingu-MT brings together 43 diverse datasets covering 84 language pairs and 46 African languages across 29 countries. From Swahili and Yoruba to Hausa, Wolof, Amharic, and more, it captures the linguistic richness and cultural vibrancy of Africa. Widely spoken bridge languages like Arabic, English, and French are also included to enable broad multilingual research.

A standout feature of AfroLingu-MT is the inclusion of a manually translated test set focused on the government domain. This enables realistic and high-impact evaluations for systems designed to serve citizens, governments, and organizations operating in African languages.

AfroLingu-MT is the foundation for evaluating Toucan - our many-to-many machine translation model - which is built on the Cheetah language models. Toucan supports 156 translation directions and significantly outperforms other models, including Meta's NLLB-200, by +6.96 points on the spBLEU_1K metric.

Evaluation is powered by state-of-the-art, multilingual metrics specifically designed for under-resourced and morphologically rich languages:

  • spBLEU1K - a SentencePiece BLEU score supporting over 1,000 languages, including 614 African languages
  • ChrF++ - a character-based F-score metric that handles inflectional and agglutinative variation
  • AfriCOMET - a semantic evaluation metric tailored for African language translation

We believe AfroLingu-MT will serve as a catalyst for inclusive NLP innovation and empower researchers and developers to build real-world MT systems that reflect and support Africa's linguistic diversity.

Explore the dataset: https://huggingface.co/datasets/UBC-NLP/AfroLingu-MT

Citations

If you use AfroLingu-MT in your research, please cite:

@inproceedings{elmadany2024toucan,
  title={Toucan: Many-to-Many Translation for 150 African Language Pairs},
  author={Elmadany, Abdelrahim and Adebara, Ife and Abdul-Mageed, Muhammad},
  booktitle={Findings of the Association for Computational Linguistics ACL 2024},
  pages={13189--13206},
  year={2024}
}