LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Sebastian Sztwiertnia; Felix Friedrich; Kristian Kersting; Patrick Schramowski; Bj\"orn Deiseroth

arXiv:2512.07522·cs.CL·December 9, 2025

LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Bj\"orn Deiseroth

PDF

Open Access

TL;DR

LIME introduces linguistic metadata embeddings to enhance language model pre-training efficiency and performance, achieving faster adaptation, improved tokenization, and better reasoning and arithmetic accuracy across various model sizes.

Contribution

LIME is the first method to incorporate linguistic metadata directly into token embeddings, significantly boosting pre-training efficiency and language modeling capabilities.

Findings

01

Up to 56% faster adaptation to training data.

02

Only 0.01% additional parameters needed.

03

Improves reasoning accuracy by up to 38%.

Abstract

Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification