A New Metric for Evaluating Translation Across 1,000+ Languages
We're excited to release spBLEU1K, our enhanced translation evaluation metric designed for 1,000+ languages. Built to address the limitations of standard BLEU and existing tokenization-based scoring, spBLEU1K is part of our ACL 2024 Toucan release.
While traditional BLEU scores are highly sensitive to tokenization, spBLEU1K builds on SentencePiece-based approaches and significantly expands language coverage - including 614 African and 53 Indigenous American languages.
We collected monolingual data from 1,003 languages spanning Wikipedia, Wikibooks, religious texts, newspapers, and the MADLAD-400 corpus. To balance under-resourced languages, we applied temperature upsampling during SentencePiece model training.
Previous metrics like spBLEU covered only 23 of the 43 languages in AfroLingu-MT. spBLEU1K broadens this to more than 1,000 - offering a more inclusive and accurate evaluation for machine translation, especially for low-resource languages.
Install the required tools:
git clone https://github.com/UBC-NLP/Toucan.git
cd spBLEU-1K/sacrebleu
pip install -e .
pip install evaluate
Now you can compute scores using HuggingFace's evaluate:
import evaluate
metric = evaluate.load('sacrebleu')
predictions = ['...']
references = [['...']]
# Default sacreBLEU
print('sacreBLEU =', metric.compute(predictions=predictions, references=references)['score'])
# spBLEU tokenizer
print('spBLEU =', metric.compute(tokenize='spm', predictions=predictions, references=references)['score'])
# spBLEU1K tokenizer
print('spBLEU1K =', metric.compute(tokenize='spBLEU-1K', predictions=predictions, references=references)['score'])
If you use spBLEU1K in your research, please cite:
@inproceedings{elmadany2024toucan,
title={Toucan: Many-to-Many Translation for 150 African Language Pairs},
author={Elmadany, Abdelrahim and Adebara, Ife and Abdul-Mageed, Muhammad},
booktitle={Findings of the Association for Computational Linguistics ACL 2024},
pages={13189--13206},
year={2024}
}
}