VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Gabriel Sarch; Lawrence Jang; Michael J. Tarr; William W. Cohen; Kenneth Marino; Katerina Fragkiadaki

arXiv:2406.14596·cs.CV·September 19, 2025

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

PDF

Open Access

TL;DR

This paper introduces ICAL, a method enabling vision-language models to self-reflect and refine their experiences into high-quality, generalized strategies, significantly enhancing task performance with less human feedback.

Contribution

ICAL is a novel framework that allows VLM agents to abstract and improve their trajectories through self-reflection and iterative human feedback, leading to better decision-making and reduced manual effort.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Improves task success rates significantly over raw demonstrations.

03

Scales more efficiently, requiring less human feedback.

Abstract

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot learning but require high-quality demonstrations. We propose In-Context Abstraction Learning (ICAL), enabling VLM agents to transform suboptimal trajectories into high-quality training data through self-reflection and human feedback. Given imperfect task demonstrations, a VLM abstracts trajectories into generalized strategies and action annotations by correcting inefficiencies and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These annotations are iteratively refined through human feedback during execution in similar environments. The resulting examples significantly improve decision-making when used for retrieval-augmented generation or fine-tuning. As the agent's example library grows, it becomes more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation

MethodsLib