Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali, Michael Fromm, Klaudia Thellmann, Jan Ebert, Alexander Arno Weber, Richard Rutmann, Charvi Jain, Max L\"ubbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff, Lena Jurkschat, Hammam Abdelwahab, Benny J\"org Stein, Karl-Heinz Sylla

TL;DR
This paper introduces two multilingual large language models, Teuken 7B-base and Teuken 7B-instruct, designed to support all 24 EU languages, demonstrating strong multilingual performance and addressing limitations of English-centric models.
Contribution
The paper presents new multilingual LLMs supporting all EU languages, with tailored data, tokenizer, and training methods, filling a gap in European language AI resources.
Findings
Strong performance on European multilingual benchmarks
Supports all 24 EU official languages
Addresses English-centric bias in existing LLMs
Abstract
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗openGPT-X/Teuken-7B-instruct-research-v0.4model· 1.9k dl· ♡ 891.9k dl♡ 89
- 🤗openGPT-X/Teuken-7B-instruct-commercial-v0.4model· 1.5k dl· ♡ 741.5k dl♡ 74
- 🤗KnutJaegersberg/Teuken-7B-instruct-commercial-v0.4-8.0bpw-exl2model
- 🤗KnutJaegersberg/Teuken-7B-instruct-research-v0.4-8.0bpw-exl2model· 1 dl1 dl
- 🤗QuantFactory/Teuken-7B-instruct-research-v0.4-GGUFmodel· 281 dl· ♡ 2281 dl♡ 2
- 🤗QuantFactory/Teuken-7B-instruct-commercial-v0.4-GGUFmodel· 293 dl· ♡ 2293 dl♡ 2
- 🤗stelterlab/Teuken-7B-instruct-commercial-v0.4-AWQmodel
- 🤗openGPT-X/Teuken-7B-base-v0.6model· 401 dl· ♡ 9401 dl♡ 9
- 🤗rhcl/Teuken-fientuunmodel· 1 dl1 dl
- 🤗RichardErkhov/openGPT-X_-_Teuken-7B-instruct-commercial-v0.4-awqmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems
MethodsFocus
