SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs

Jiaming Xu; Jiayi Pan; Hanzhen Wang; Yongkang Zhou; Jiancai Ye; Yu Wang; Guohao Dai

arXiv:2512.00722·cs.AI·December 2, 2025

SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs

Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, Guohao Dai

PDF

Open Access

TL;DR

SpeContext introduces a novel system for long-context reasoning in LLMs by leveraging a distilled language model for efficient retrieval, system optimization, and memory management, significantly improving throughput with minimal accuracy loss.

Contribution

The paper presents SpeContext, a new algorithm and system that enhances long-context reasoning in LLMs through lightweight retrieval, asynchronous dataflow, and adaptive memory management.

Findings

01

Achieves up to 24.89x throughput improvement in cloud environments.

02

Realizes 10.06x speedup on edge devices.

03

Maintains negligible accuracy loss while significantly increasing efficiency.

Abstract

In this paper, we point out that the objective of the retrieval algorithms is to align with the LLM, which is similar to the objective of knowledge distillation in LLMs. We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory, and thus propose a novel paradigm that leverages a DLM as the retrieval algorithm. Based on the insight, we present SpeContext, an algorithm and system co-design for long-context reasoning. (1) At the algorithm level, SpeContext proposes lightweight retrieval head based on the head-level attention weights of DLM, achieving > 90% parameters reduction by pruning the redundancy. (2) At the system level, SpeContext designs an asynchronous prefetch dataflow via the elastic loading strategy, effectively overlapping KV cache retrieval with the LLM computation. (3) At the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Natural Language Processing Techniques · Big Data and Digital Economy