Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

Dikshant Kukreja (1); Kshitij Sah (1); Gautam Gupta (1); Avinash Anand (4); Rajiv Ratn Shah (1); Zhengkui Wang (4); Aik Beng Ng (3); Erik Cambria (2) ((1) IIIT Delhi; India; (2) Nanyang Technological University; (3) NVIDIA; (4) Singapore Institute of Technology)

arXiv:2604.13275·cs.CL·April 16, 2026

Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

Dikshant Kukreja (1), Kshitij Sah (1), Gautam Gupta (1), Avinash Anand (4), Rajiv Ratn Shah (1), Zhengkui Wang (4), Aik Beng Ng (3), Erik Cambria (2) ((1) IIIT Delhi, India, (2) Nanyang Technological University, (3) NVIDIA, (4) Singapore Institute of Technology)

PDF

TL;DR

Larger language models exhibit a paradoxical divergence in handling context, becoming better at ignoring false claims but worse at ignoring irrelevant tokens, with scaling laws revealing opposing trends for semantic and non-semantic contexts.

Contribution

This paper formalizes the first scaling laws for contextual entrainment and analyzes how model size affects different types of context handling in language models.

Findings

01

Larger models are more resistant to counterfactual misinformation.

02

Larger models are more prone to copying arbitrary tokens.

03

Entrainment follows predictable power-law scaling with opposite trends for semantic and non-semantic contexts.

Abstract

Larger language models become simultaneously better and worse at handling contextual information -- better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.