Natural Language Generation for 517 African Languages
We're proud to introduce Cheetah, a massively multilingual NLG language model for African languages, supporting 517 languages and dialects across 50 countries. Developed as part of our ACL 2024 work, Cheetah is designed to promote linguistic diversity, support practical applications, and close the NLG gap for underrepresented African languages.
Cheetah is trained on a 42GB curated corpus from diverse domains — including news, health, government, religious, and social media texts. Languages are written in five scripts and span 14 language families. The model uses an MT5-style encoder-decoder with ~580M parameters, pretrained on TPUv3-128 with a batch size of 1024.
Try it on Hugging Face: cheetah-base
We evaluate Cheetah on AfroNLG, a benchmark of 67 test sets across tasks like machine translation, summarization, QA, paraphrasing, and cloze. Cheetah outperforms other models in 5 of 7 tasks, demonstrating strong generalization and fluency across African languages.
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/cheetah-base")
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/cheetah-base")
yor_prompt = "ìròyìn kan nípa owó ìjọba kan"
input_ids = tokenizer(yor_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("Tokenized input:", tokenizer.tokenize(yor_prompt))
print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))
Cheetah aligns with Afrocentric NLP values, prioritizing inclusion and language preservation. It supports indigenous communication, helps address technological marginalization, and creates new opportunities for education, research, and language revitalization across Africa.
If you use Cheetah in your research, please cite:
@inproceedings{adebara-etal-2024-cheetah,
title = "Cheetah: Natural Language Generation for 517 African Languages",
author = "Adebara, Ife and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.691",
pages = "12798--12823"}
@inproceedings{elmadany2024toucan,
title={Toucan: Many-to-Many Translation for 150 African Language Pairs},
author={Elmadany, Abdelrahim and Adebara, Ife and Abdul-Mageed, Muhammad},
booktitle={Findings of the Association for Computational Linguistics ACL 2024},
pages={13189--13206},
year={2024}}