Tagengo: A Multilingual Chat Dataset
Peter Devine

TL;DR
This paper introduces Tagengo, a high-quality multilingual chat dataset with over 70,000 prompt-response pairs across 74 languages, used to train and evaluate a multilingual open source language model that outperforms previous models.
Contribution
The creation of a large, diverse multilingual dataset and the training of a state-of-the-art open source multilingual chat model demonstrating improved performance across languages.
Findings
Multilingual model outperforms previous open source models in 6 languages.
Training on diverse multilingual data benefits performance in individual languages.
High-quality multilingual data is essential for accessible language models.
Abstract
Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually. We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language. These results indicate the necessity of training on large amounts of high quality multilingual data to make a more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗lightblue/suzume-llama-3-8B-japanese-ggufmodel· 171 dl· ♡ 12171 dl♡ 12
- 🤗lightblue/suzume-llama-3-8B-japanesemodel· 25 dl· ♡ 2425 dl♡ 24
- 🤗lightblue/suzume-llama-3-8B-multilingual-ggufmodel· 88 dl· ♡ 2788 dl♡ 27
- 🤗lightblue/suzume-llama-3-8B-multilingualmodel· 14k dl· ♡ 11414k dl♡ 114
- 🤗QuantFactory/suzume-llama-3-8B-japanese-GGUFmodel· 17 dl· ♡ 117 dl♡ 1
- 🤗RichardErkhov/lightblue_-_suzume-llama-3-8B-japanese-ggufmodel· 36 dl36 dl
- 🤗RichardErkhov/lightblue_-_suzume-llama-3-8B-multilingual-ggufmodel· 13 dl13 dl
- 🤗ptrdvn/suzume-llama-3-8B-multilingualmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Communication and Language
