BgGPT 1.0: Extending English-centric LLMs to other languages
Anton Alexandrov, Veselin Raychev, Dimitar I. Dimitrov, Ce Zhang,, Martin Vechev, Kristina Toutanova

TL;DR
This paper introduces BgGPT models optimized for Bulgarian, demonstrating strong language-specific performance while preserving English capabilities through continual learning and comprehensive benchmarking.
Contribution
It presents new Bulgarian-optimized versions of Gemma-2 models with continual learning strategies and detailed benchmarks, expanding multilingual AI applications.
Findings
Models achieve high performance on Bulgarian language tasks.
English performance remains robust after fine-tuning.
Benchmark results validate effectiveness of the approach.
Abstract
We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct: continually pretrained and fine-tuned versions of Google's Gemma-2 models, specifically optimized for Bulgarian language understanding and generation. Leveraging Gemma-2's multilingual capabilities and over 100 billion tokens of Bulgarian and English text data, our models demonstrate strong performance in Bulgarian language tasks, setting a new standard for language-specific AI models. Our approach maintains the robust capabilities of the original Gemma-2 models, ensuring that the English language performance remains intact. To preserve the base model capabilities, we incorporate continual learning strategies based on recent Branch-and-Merge techniques as well as thorough curation and selection of training data. We provide detailed insights into our methodology, including the release of model weights with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗llm-bg/Tucan-2.6B-v1.0model· 13 dl· ♡ 413 dl♡ 4
- 🤗llm-bg/Tucan-2.6B-v1.0-LoRAmodel
- 🤗llm-bg/Tucan-2.6B-v1.0-GGUFmodel· 148 dl· ♡ 1148 dl♡ 1
- 🤗llm-bg/Tucan-9B-v1.0-LoRAmodel
- 🤗llm-bg/Tucan-9B-v1.0model· 7 dl· ♡ 27 dl♡ 2
- 🤗llm-bg/Tucan-9B-v1.0-GGUFmodel· 133 dl· ♡ 1133 dl♡ 1
- 🤗llm-bg/Tucan-27B-v1.0model· 12 dl· ♡ 212 dl♡ 2
- 🤗llm-bg/Tucan-27B-v1.0-LoRAmodel
- 🤗llm-bg/Tucan-27B-v1.0-GGUFmodel· 112 dl· ♡ 1112 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Library Science and Information Systems
MethodsBalanced Selection · Sparse Evolutionary Training
