Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management
Mo Li, L.H. Xu, Qitai Tan, Long Ma, Ting Cao, Yunxin Liu

TL;DR
Sculptor introduces a framework that enables LLMs to actively manage their internal context using tools, significantly improving reasoning and performance on long-context tasks without additional training.
Contribution
The paper presents Sculptor, a novel approach empowering LLMs with active context management tools and a dynamic RL training method to enhance long-context reasoning capabilities.
Findings
Significant performance improvements on long-context benchmarks.
Effective mitigation of proactive interference in LLMs.
Demonstrated benefits without additional training.
Abstract
Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) precise search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on diverse long-context benchmarks demonstrates that…
Peer Reviews
Decision·ICLR 2026 Poster
The paper considers the setting of actively managing working memory with agents tool calls as opposed to adding everything into context (e.g. RAG). The proposed RL technique improves the model's usage of the memory sculpting tools.
The primary weakness of this paper is the severe lack of contextualization within the broader field of agent memory and context management. The authors' attempt to frame their contribution as "internal working memory" as distinct from "external memory systems" feels artificial and allows them to avoid engaging with a vast and highly relevant body of work. Moving the important related works into the Appendix also seems like a malicious attempt to evade the comparison with the many existing memory
- The paper articulates a clear and compelling motivation, identifying proactive interference as an underexplored yet fundamental bottleneck in long-context LLMs, and proposes active context management as a complementary direction to architectural scaling or external-memory augmentation. - The Sculptor framework is carefully designed and well-documented, with six context management tools whose functionalities are grounded in cognitive principles (e.g., selective attention, reversible memory) and
- The reinforcement learning formulation remains under-analyzed: outcome-based rewards for long-horizon, agentic tasks such as active context management are known to suffer from credit assignment and instability issues, yet the paper provides no empirical or diagnostic analysis of training stability. - There seems to be some missing comparisons with some well-known related works, such as MemGPT and MemoryLLM. - Despite advocating for autonomous context management, the framework still relies heav
1. The paper provides another direction of solving the performance drop problem of LLMs due to long context without external memory augmentation. 2. The proposed tools are effective with simple prompt engineering, and their effacy is maximized by the RL based training process with a novel incremental loss assignment mechanism. 3. The context length can be dramatically shortened so that the inference efficiency can be largely improved, while obtaining even better performances. 4. The impact on
1. The biggest concern is on whether the proposed methodology is generally effective for other open-source LLMs, such as the Qwen and Llama series. Table 3 in the appendix shows the base model comparison, which contains 2 version of Qwen3 models. Why can't the authors test their method on these two models? I notice that M3 is especially good on tool use task, while not so powerful on other tasks comparing with Qwen3. Does that mean the proposed method can only be applied by LLMs specialized in t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis
