Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

Patrick Haller; Jonas Golde; Alan Akbik

arXiv:2511.05560·cs.CL·November 11, 2025

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

Patrick Haller, Jonas Golde, Alan Akbik

PDF

Open Access 1 Video

TL;DR

This paper introduces BLaLM, a sample-efficient language model using linear attention and lightweight techniques, demonstrating improved zero-shot performance and training stability in low-resource settings.

Contribution

It presents a novel architecture combining linear attention with lightweight enhancements and a curated corpus, advancing efficient language modeling without scale dependence.

Findings

01

Linear attention with sliding window improves zero-shot performance

02

Muon optimizer stabilizes training and reduces perplexity

03

Effective low-resource language modeling strategies

Abstract

We study architectural and optimization techniques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM token mixer and explores lightweight enhancements, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support training in low-resource settings, we curate a high-quality corpus emphasizing readability and pedagogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding window attention consistently improves zero-shot performance, and (2) the Muon optimizer stabilizes convergence and reduces perplexity over AdamW. These results highlight effective strategies for efficient language modeling without relying on scale.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements· underline

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques