Training-Inference Consistent Segmented Execution for Long-Context LLMs

Xianpeng Shang; Jiang Li; Zehua Duo; Qianyi Cai; Xiangdong Su

arXiv:2605.11744·cs.CL·May 13, 2026

Training-Inference Consistent Segmented Execution for Long-Context LLMs

Xianpeng Shang, Jiang Li, Zehua Duo, Qianyi Cai, Xiangdong Su

PDF

TL;DR

This paper introduces a training-inference consistent segmentation framework for long-context LLMs, improving scalability and efficiency while maintaining performance comparable to full-context attention.

Contribution

It proposes a novel segment-level generation method that aligns training and inference, reducing memory usage and enhancing scalability for long-context models.

Findings

01

Achieves performance comparable to full-context attention on benchmarks.

02

Reduces peak prefill memory by approximately 6x at 128K context length.

03

Offers competitive latency-memory trade-offs against inference-efficient baselines.

Abstract

Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.