Massively Multilingual Language Models for Africa
We are thrilled to introduce SERENGETI, a groundbreaking suite of massively multilingual pretrained language models (mPLMs) created to empower African languages and revolutionize NLP research on the continent. With coverage of 517 African languages, SERENGETI delivers state-of-the-art performance and opens the door to more inclusive and representative AI.
Training Data: 42GB of multi-domain, multi-script texts from religious, news, government, health documents, and existing corpora. Scripts include Arabic, Coptic, Ethiopic, Latin, and Vai.
Architecture: SERENGETI includes Electra-style models (E110 and E250) and an XLM-R base model. All models have 12 layers and 12 attention heads.
Model Links:
SERENGETI was benchmarked on 8 tasks and 20 datasets, covering 32 African languages:
It achieves state-of-the-art performance on 11 datasets and demonstrates strong generalization in zero-shot settings.
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")
model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")
from transformers import pipeline
classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
classifier("ẹ jọwọ , ẹ mi") #Yoruba
SERENGETI supports 517 African languages and language varieties. See the full list in the official repo.
SERENGETI aligns with Afrocentric NLP by prioritizing the technological needs of African communities. It promotes language preservation and inclusive access to NLP tools. Native speakers contributed to dataset validation, ensuring linguistic quality and reducing bias.
If you use SERENGETI in your research, please cite:
@inproceedings{adebara-etal-2023-serengeti,
title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
author = "Adebara, Ife and
Elmadany, AbdelRahim and
Abdul-Mageed, Muhammad and
Alcoba Inciarte, Alcides",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.97",
doi = "10.18653/v1/2023.findings-acl.97",
pages = "1498--1537",
}