Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

TL;DR
Goldfish introduces small monolingual language models for 350 languages, outperforming large multilingual models in perplexity and grammaticality, and provides the first public monolingual models for 215 languages to aid low-resource language research.
Contribution
The paper presents a new suite of small monolingual models for 350 languages, outperforming large multilingual models in key metrics, and releases the first public models for many low-resource languages.
Findings
Small monolingual models outperform large multilingual models in perplexity.
Monolingual models excel in grammaticality benchmarks.
First public monolingual models for 215 low-resource languages.
Abstract
For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗goldfish-models/afr_latn_1000mbmodel· 4 dl4 dl
- 🤗goldfish-models/amh_ethi_1000mbmodel· 19 dl19 dl
- 🤗goldfish-models/arb_arab_1000mbmodel· 251 dl251 dl
- 🤗goldfish-models/aze_latn_1000mbmodel· 4 dl4 dl
- 🤗goldfish-models/bel_cyrl_1000mbmodel· 17 dl17 dl
- 🤗goldfish-models/ben_beng_1000mbmodel· 16 dl16 dl
- 🤗goldfish-models/bos_cyrl_1000mbmodel· 1 dl1 dl
- 🤗goldfish-models/bos_latn_1000mbmodel· 4 dl4 dl
- 🤗goldfish-models/bul_cyrl_1000mbmodel· 16 dl16 dl
- 🤗goldfish-models/cat_latn_1000mbmodel· 171 dl171 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Computational and Text Analysis Methods · Language and cultural evolution
MethodsLinear Layer · Residual Connection · Multi-Head Attention · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings
