Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi
Shiza Fatimah, Aniket Sen, Sophia Falk, Florian Mai, Lucie Flek, Nicholas Kluge Corr\^ea

TL;DR
This paper presents LilMoo, a 0.6-billion-parameter Hindi language model trained from scratch with a transparent pipeline, outperforming larger multilingual models and addressing linguistic inequalities in NLP for low-resource languages.
Contribution
Introduces LilMoo, a fully trained Hindi language model from scratch using a high-quality dataset and optimized training recipes, challenging the reliance on multilingual foundations.
Findings
LilMoo outperforms comparable multilingual models like Qwen2.5 and Qwen3.
A high-quality Hindi corpus (GigaLekh) was constructed using heuristic and LLM-based filtering.
Designing language-specific pretraining can match larger multilingual models at small scales.
Abstract
The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
