Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

TL;DR
This paper introduces AGENT KB, a universal memory system that enables cross-domain experience sharing among heterogeneous AI agents, significantly improving their problem-solving capabilities across various benchmarks.
Contribution
It presents AGENT KB, a novel shared memory infrastructure that facilitates cross-architecture knowledge transfer without retraining, enhancing agent performance across multiple frameworks.
Findings
Up to 18.7 percentage point improvement in pass@3 for smolagents.
4.0 percentage point improvement on SWE-bench pass@1 for OpenHands.
Hybrid retrieval and feedback are crucial for performance gains.
Abstract
AI agent frameworks operate in isolation, forcing agents to rediscover solutions and repeat mistakes across different systems. Despite valuable problem-solving experiences accumulated by frameworks like smolagents, OpenHands, and OWL, this knowledge remains trapped within individual systems, preventing the emergence of collective intelligence. Current memory systems focus on individual agents or framework-specific demonstrations, failing to enable cross-architecture knowledge transfer. We introduce AGENT KB, a universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining. AGENT KB aggregates trajectories into a structured knowledge base and serves lightweight APIs. At inference time, hybrid retrieval operates through two stages: planning seeds agents with cross-domain workflows, while feedback applies targeted diagnostic…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The authors built a memory system that demonstrates performance improvements on evals across different model families, frameworks, and eval types. Using cross-framework memory store is new and builds on prior memory stores that focus more on a specific framework. By collecting trajectories across multiple frameworks, the overall system can benefit from the diversity coming from different frameworks. Separating planning from feedback is also relatively new. Evaluation across a number of domains i
This paper could be substantially stronger if the improvements from Agent KB style memory were compared to gains from other memory systems (e.g., other classic RAG systems / embedding database, scratchpad, or other API-based memory store). This would help establish novelty compared to other similar systems, and make clearer why this approach is superior and worth continuing to push on. The authors evaluate on SWE-Bench from 2023, which is known to have ~50% broken problems and is not the communi
1. The motivation—enabling collective intelligence across different agent frameworks—is timely and well-justified. 2. The paper shows consistent and substantial gains across diverse benchmarks, agent types, and model backbones, with convincing ablation studies to support claims.
1. The framework-agnostic experience representation appears to be the key innovation, yet its implementation details and technical challenges are not clearly explained (after reading the appendices). 2. The disagreement gate for refinement rejection, another key contribution, seems heuristic-driven; it may wrongly reject beneficial refinements (that might make the embedding similarity low) if the initial plan is flawed. 3. While the experiments are extensive, the paper’s readability and structur
The research question is well motivated, aiming to target 3 challenges: representation heterogeneity, context mismatch, and knowledge interference. This is an impactful problem. Strong quantitative empirical results, with consistent improvements compared to prior memory-based systems of A-MEM.
1. The ability to abstract and distill the "heterogeneous agent trajectories into structured experience units" is a key concept that the pipeline relies on, and it is implemented by "few-shot prompting (10-15 human-curated exemplars per domain)" and "standardized action vocabularies". This is somewhat brittle to claim the method is seamless if left unjustified. The reliance on "standardized action vocabularies" seems to hide a great deal of complexity. Does this mean the system can only integrat
1. This paper touches an important and timely research problem, about how to conduct agent memory effectively for complicated agentic tasts. 2. The peformance reporting on GAIA and SWE-bench is promising and encouraging. 3.The system seems plug-and-play, which can be used by different scalffold and show promising results.
1. The novelty is limited and very enginerring-heavy: - Reason-Retrieve-Refine is largely borrowed from case-based reasoning literature - hybrid retrieval (BM25 + semantic) is standard practice and Few-shot experience generation uses straightforward prompting - They disagreemnt threshold is based on embedding which is very common in many RAG works, and it's not clear to illustrate how to set this threshold, which may tuned based on target data. 2. The experiments are weak: - There are no compar
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Reinforcement Learning in Robotics
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · Balanced Selection · GPT-4
