Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
Patrick Haller, Jonas Golde, Alan Akbik

TL;DR
This paper introduces BLaLM, a sample-efficient language model using linear attention and lightweight techniques, demonstrating improved zero-shot performance and training stability in low-resource settings.
Contribution
It presents a novel architecture combining linear attention with lightweight enhancements and a curated corpus, advancing efficient language modeling without scale dependence.
Findings
Linear attention with sliding window improves zero-shot performance
Muon optimizer stabilizes training and reduces perplexity
Effective low-resource language modeling strategies
Abstract
We study architectural and optimization techniques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM token mixer and explores lightweight enhancements, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support training in low-resource settings, we curate a high-quality corpus emphasizing readability and pedagogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding window attention consistently improves zero-shot performance, and (2) the Muon optimizer stabilizes convergence and reduces perplexity over AdamW. These results highlight effective strategies for efficient language modeling without relying on scale.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
