UBC - Deep Learning and Natural Language Processing Group

Introducing SERENGETI

Massively Multilingual Language Models for Africa

We are thrilled to introduce SERENGETI, a groundbreaking suite of massively multilingual pretrained language models (mPLMs) created to empower African languages and revolutionize NLP research on the continent. With coverage of 517 African languages, SERENGETI delivers state-of-the-art performance and opens the door to more inclusive and representative AI.

Our Language Models

Training Data: 42GB of multi-domain, multi-script texts from religious, news, government, health documents, and existing corpora. Scripts include Arabic, Coptic, Ethiopic, Latin, and Vai.

Architecture: SERENGETI includes Electra-style models (E110 and E250) and an XLM-R base model. All models have 12 layers and 12 attention heads.

Model Links:

Serengeti-E110: Electra with 110K vocab
Serengeti-E250: Electra with 250K vocab
Serengeti: XLM-R base model

AfroNLU Benchmark and Evaluation

SERENGETI was benchmarked on 8 tasks and 20 datasets, covering 32 African languages:

Named Entity Recognition
Phrase Chunking
Part of Speech Tagging
News Classification
Sentiment Analysis
Topic Classification
Question Answering
Language Identification

It achieves state-of-the-art performance on 11 datasets and demonstrates strong generalization in zero-shot settings.

How to Use SERENGETI

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")

model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")
from transformers import pipeline

classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
classifier("ẹ jọwọ , ẹ  mi") #Yoruba

Supported Languages

SERENGETI supports 517 African languages and language varieties. See the full list in the official repo.

Ethics

SERENGETI aligns with Afrocentric NLP by prioritizing the technological needs of African communities. It promotes language preservation and inclusive access to NLP tools. Native speakers contributed to dataset validation, ensuring linguistic quality and reducing bias.

Citation

If you use SERENGETI in your research, please cite:

@inproceedings{adebara-etal-2023-serengeti,
    title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
    author = "Adebara, Ife  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad  and
      Alcoba Inciarte, Alcides",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.97",
    doi = "10.18653/v1/2023.findings-acl.97",
    pages = "1498--1537",
}

UBC: Deep Learning &
Natural Language Processing