LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Bj\"orn Deiseroth

TL;DR
LIME introduces linguistic metadata embeddings to enhance language model pre-training efficiency and performance, achieving faster adaptation, improved tokenization, and better reasoning and arithmetic accuracy across various model sizes.
Contribution
LIME is the first method to incorporate linguistic metadata directly into token embeddings, significantly boosting pre-training efficiency and language modeling capabilities.
Findings
Up to 56% faster adaptation to training data.
Only 0.01% additional parameters needed.
Improves reasoning accuracy by up to 38%.
Abstract
Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
