Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

H S V N S Kowndinya Renduchintala; Sumit Bhatia

arXiv:2604.17930·cs.CL·April 21, 2026

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

H S V N S Kowndinya Renduchintala, Sumit Bhatia

PDF

1 Repo

TL;DR

This study shows that small language models' linguistic shortcomings can be significantly improved through targeted data augmentation, emphasizing the importance of data composition.

Contribution

It demonstrates that minimal synthetic data injection can substantially enhance performance on difficult linguistic phenomena in small language models.

Findings

01

Targeted data injection improved 8 out of 9 worst-performing paradigms.

02

Accuracy on only_npi_scope increased from 20.9% to 69.4%.

03

Most phenomena remained unaffected or slightly improved after intervention.

Abstract

Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.