Training-Inference Consistent Segmented Execution for Long-Context LLMs
Xianpeng Shang, Jiang Li, Zehua Duo, Qianyi Cai, Xiangdong Su

TL;DR
This paper introduces a training-inference consistent segmentation framework for long-context LLMs, improving scalability and efficiency while maintaining performance comparable to full-context attention.
Contribution
It proposes a novel segment-level generation method that aligns training and inference, reducing memory usage and enhancing scalability for long-context models.
Findings
Achieves performance comparable to full-context attention on benchmarks.
Reduces peak prefill memory by approximately 6x at 128K context length.
Offers competitive latency-memory trade-offs against inference-efficient baselines.
Abstract
Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
