A Benchmark for Procedural Memory Retrieval in Language Agents
Ishant Kohar, Aswanth Krishnan

TL;DR
This paper introduces a benchmark to evaluate procedural memory retrieval in language agents, highlighting the limitations of current embedding methods and demonstrating the effectiveness of LLM-generated abstractions for cross-context transfer.
Contribution
It provides the first diagnostic framework to distinguish procedural understanding from surface memorization and evaluates retrieval methods on a new benchmark using ALFWorld.
Findings
Embedding methods perform well on familiar contexts but poorly on novel ones.
LLM-generated procedural abstractions transfer reliably across different contexts.
Corpus scale significantly improves retrieval performance, surpassing representation enrichment.
Abstract
Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies -- a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval from task execution, evaluating whether agents can recognize functionally equivalent procedures that span different object instantiations. Using ALFWorld, we construct dual corpora of expert and LLM-generated trajectories and evaluate six retrieval methods using systematically stratified queries. Our results expose a clear generalization cliff: embedding-based methods perform strongly on familiar contexts, yet degrade considerably on novel ones, while LLM-generated procedural abstractions demonstrate reliable cross-context transfer. Controlled ablations show that although embeddings capture some lexical-level abstraction, they fundamentally treat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Neurobiology of Language and Bilingualism
