UBC Logo

UBC: Deep Learning &
Natural Language Processing

NLP for Africa

Africa Logo

Introducing Cheetah

Natural Language Generation for 517 African Languages

We're proud to introduce Cheetah, a massively multilingual NLG language model for African languages, supporting 517 languages and dialects across 50 countries. Developed as part of our ACL 2024 work, Cheetah is designed to promote linguistic diversity, support practical applications, and close the NLG gap for underrepresented African languages.

Training Data & Model

Cheetah is trained on a 42GB curated corpus from diverse domains — including news, health, government, religious, and social media texts. Languages are written in five scripts and span 14 language families. The model uses an MT5-style encoder-decoder with ~580M parameters, pretrained on TPUv3-128 with a batch size of 1024.

Try it on Hugging Face: cheetah-base

Evaluation: AfroNLG Benchmark

We evaluate Cheetah on AfroNLG, a benchmark of 67 test sets across tasks like machine translation, summarization, QA, paraphrasing, and cloze. Cheetah outperforms other models in 5 of 7 tasks, demonstrating strong generalization and fluency across African languages.

Example Usage

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/cheetah-base")
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/cheetah-base")

yor_prompt = "ìròyìn kan nípa owó ìjọba  kan"
input_ids = tokenizer(yor_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print("Tokenized input:", tokenizer.tokenize(yor_prompt))
print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Ethics & Broader Impact

Cheetah aligns with Afrocentric NLP values, prioritizing inclusion and language preservation. It supports indigenous communication, helps address technological marginalization, and creates new opportunities for education, research, and language revitalization across Africa.

Citation

If you use Cheetah in your research, please cite:

@inproceedings{adebara-etal-2024-cheetah,
  title = "Cheetah: Natural Language Generation for 517 African Languages",
  author = "Adebara, Ife and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad",
  booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2024",
  address = "Bangkok, Thailand and virtual meeting",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2024.acl-long.691",
  pages = "12798--12823"}
@inproceedings{elmadany2024toucan,
  title={Toucan: Many-to-Many Translation for 150 African Language Pairs},
  author={Elmadany, Abdelrahim and Adebara, Ife and Abdul-Mageed, Muhammad},
  booktitle={Findings of the Association for Computational Linguistics ACL 2024},
  pages={13189--13206},
  year={2024}}