Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture

Jiarun Liu; Shiyue Xu; Yang Li; Shangkun Liu; Yongli Yu; Peng Cao

arXiv:2512.11303·cs.CL·December 15, 2025

Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture

Jiarun Liu, Shiyue Xu, Yang Li, Shangkun Liu, Yongli Yu, Peng Cao

PDF

Open Access 3 Reviews

TL;DR

This paper presents SMITH, a cognitive architecture that unifies dynamic tool creation and cross-task experience sharing using hierarchical memory, significantly improving adaptive capabilities of language model agents.

Contribution

Introduces SMITH, a novel framework integrating tool creation and experience sharing via hierarchical memory, formalized as iterative code generation and episodic retrieval, with demonstrated superior performance.

Findings

01

Achieves 81.8% Pass@1 accuracy on GAIA benchmark.

02

Outperforms state-of-the-art methods Alita and Memento.

03

Demonstrates effective capability expansion and experience reuse.

Abstract

Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

SMITH achieves state-of-the-art performance on GAIA with substantial improvements: +6.6% over Alita (best tool creation approach) and +10.9% over Memento. The nested loop architecture (inner developer-tester loop + outer planner loop) with sandbox execution and multi-path sampling is well-engineered.

Weaknesses

The paper only evaluates on GAIA (165 validation tasks, 300 test tasks). This is insufficient to demonstrate generalization. I would suggest to evaluate on at least 2-3 additional diverse agentic benchmarks (e.g., WebArena, SWE-bench, HotPotQA, InterCode), and also the authors should test cross-domain transfer more rigorously. There are also several missing related works, e.g. Agent Workflow Memory, ToolGen.

Reviewer 02Rating 4Confidence 3

Strengths

- The paper introduces a novel architecture or framework that is formally defined and empirically implemented to achieve SOTA on an established agentic benchmark - The components in the framework are themselves intuitive and their combination is reasonable - The analysis is comprehensive and informative

Weaknesses

- As the paper proposes a rather complex architecture with many components, its readability would greatly improve with a more focused presentation. To name a few issues: - Exactly what are the key contributions *in contrast to previous work*? As I understand, it is a 3+1 novelty in the pipeline (dynamic tool creations, experience sharing, and hierarchical memory + curriculum learning) but this is not sufficiently spelled out. - I find Figure 1 uninformative as it doesn't pertain to the abo

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper proposes an agent paradigm based on episodic memory updating and semantic retrieval. The method treats the process of tool creation as part of episodic memory, thereby modeling and handling both tool creation and task trajectories within a unified framework. 2. The proposed approach achieves strong performance on GAIA and demonstrates the necessity of each component through extensive ablation studies.

Weaknesses

1. Writing quality needs improvement. - The paper employs a large number of formalized notations, but many symbols are reused with inconsistent meanings, which can easily cause confusion. - For example, in lines 146–147, $\mathcal{T}$ denotes the set of tools, but in lines 152–153 it denotes the set of tasks. - Similarly, $a$ represents the agent in lines 129–130, yet later in lines 173–174 and beyond it is reused to represent an action. - The paper frequently introduces symbols without immediat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Ferroelectric and Negative Capacitance Devices