Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure
Zsolt Csibi (2), Bence Gy\"orgy Gortka (1), Natabara Gy\"ongy\"ossy (2), Korn\'el Nagy (1), D\'avid M\'ark Nemeskey (1), Martin Sallai (1), Andr\'as Simonyi (2), Andr\'as M\'ark Szekeres (1), G\'abor Palk\'o (1) ((1) Department of Digital Humanities

TL;DR
Racka is a lightweight Hungarian language model that uses parameter-efficient continual pretraining to improve Hungarian language capabilities while maintaining performance in English and German, suitable for resource-constrained HPC environments.
Contribution
The paper introduces Racka, a novel Hungarian LLM leveraging LoRA-based continual pretraining and tokenizer adaptation for resource-efficient language model development.
Findings
Modest but stable language adaptation results.
Effective tokenizer adaptation improves Hungarian tokenization.
Model trained on diverse multilingual data set.
Abstract
We present Racka, a lightweight, continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages such as English and German. Racka employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen-3 4B backbone, making the recipe practical on A100 (40GB)-based HPC clusters with low inter-node bandwidth. To better match the training distribution, we replace and adapt the tokenizer, achieving substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German. The model is trained on 160B subword tokens drawn from a mixture of internet and high-quality curated sources, with a composition of 44% Hungarian, 24% English, 21% German, and 11% code. This data mix is chosen to mitigate catastrophic forgetting and preserve high-resource language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
