UBC Logo

UBC: Deep Learning &
Natural Language Processing

NLP for Africa

Africa Logo

Introducing SERENGETI

Massively Multilingual Language Models for Africa

We are thrilled to introduce SERENGETI, a groundbreaking suite of massively multilingual pretrained language models (mPLMs) created to empower African languages and revolutionize NLP research on the continent. With coverage of 517 African languages, SERENGETI delivers state-of-the-art performance and opens the door to more inclusive and representative AI.

Our Language Models

Training Data: 42GB of multi-domain, multi-script texts from religious, news, government, health documents, and existing corpora. Scripts include Arabic, Coptic, Ethiopic, Latin, and Vai.

Architecture: SERENGETI includes Electra-style models (E110 and E250) and an XLM-R base model. All models have 12 layers and 12 attention heads.

Model Links:

AfroNLU Benchmark and Evaluation

SERENGETI was benchmarked on 8 tasks and 20 datasets, covering 32 African languages:

  • Named Entity Recognition
  • Phrase Chunking
  • Part of Speech Tagging
  • News Classification
  • Sentiment Analysis
  • Topic Classification
  • Question Answering
  • Language Identification

It achieves state-of-the-art performance on 11 datasets and demonstrates strong generalization in zero-shot settings.

How to Use SERENGETI

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")

model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")
from transformers import pipeline

classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
classifier("ẹ jọwọ , ẹ  mi") #Yoruba

Supported Languages

SERENGETI supports 517 African languages and language varieties. See the full list in the official repo.

Ethics

SERENGETI aligns with Afrocentric NLP by prioritizing the technological needs of African communities. It promotes language preservation and inclusive access to NLP tools. Native speakers contributed to dataset validation, ensuring linguistic quality and reducing bias.

Citation

If you use SERENGETI in your research, please cite:

@inproceedings{adebara-etal-2023-serengeti,
    title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
    author = "Adebara, Ife  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad  and
      Alcoba Inciarte, Alcides",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.97",
    doi = "10.18653/v1/2023.findings-acl.97",
    pages = "1498--1537",
}