Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
Jinchang Zhu, Jindong Li, Chengyu Zou, Rong Fu, Chao Wang, Haowei He, and Menglin Yang

TL;DR
This paper introduces EXACT, a supervision-allocation method that enhances long-context adaptation in language models by emphasizing tokens with longer effective contexts, leading to significant performance improvements.
Contribution
The paper proposes EXACT, a novel supervision-allocation objective that improves long-context adaptation by balancing token-level supervision in packed training.
Findings
EXACT improves performance on multiple benchmarks across different models.
Long-distance evidence cases benefit most from the proposed method.
Standard QA and reasoning tasks remain unaffected by the new supervision strategy.
Abstract
Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token's effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
