A Benchmark for Procedural Memory Retrieval in Language Agents

Ishant Kohar; Aswanth Krishnan

arXiv:2511.21730·cs.CL·December 1, 2025

A Benchmark for Procedural Memory Retrieval in Language Agents

Ishant Kohar, Aswanth Krishnan

PDF

Open Access

TL;DR

This paper introduces a benchmark to evaluate procedural memory retrieval in language agents, highlighting the limitations of current embedding methods and demonstrating the effectiveness of LLM-generated abstractions for cross-context transfer.

Contribution

It provides the first diagnostic framework to distinguish procedural understanding from surface memorization and evaluates retrieval methods on a new benchmark using ALFWorld.

Findings

01

Embedding methods perform well on familiar contexts but poorly on novel ones.

02

LLM-generated procedural abstractions transfer reliably across different contexts.

03

Corpus scale significantly improves retrieval performance, surpassing representation enrichment.

Abstract

Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies -- a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval from task execution, evaluating whether agents can recognize functionally equivalent procedures that span different object instantiations. Using ALFWorld, we construct dual corpora of expert and LLM-generated trajectories and evaluate six retrieval methods using systematically stratified queries. Our results expose a clear generalization cliff: embedding-based methods perform strongly on familiar contexts, yet degrade considerably on novel ones, while LLM-generated procedural abstractions demonstrate reliable cross-context transfer. Controlled ablations show that although embeddings capture some lexical-level abstraction, they fundamentally treat…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Neurobiology of Language and Bilingualism