ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
Yanbin Hu, Jin Cui, Jiayi Lu, Ruixuan Yang, Jun Ye, Boran Zhao, Xingyu Chen, Xuguang Lan, and Pengju Ren

TL;DR
ECHO introduces a hierarchical memory framework using hyperbolic autoencoders to improve experience retrieval and generalization in vision-language-action models for manipulation tasks.
Contribution
The paper presents a novel hierarchical memory system with continuous space organization, enhancing long-horizon task performance and generalization in VLA models.
Findings
Achieved 12.8% improvement in success rate on LIBERO-Long tasks.
Enhanced compositional generalization on unseen long-horizon tasks.
Demonstrated effectiveness in real-world experiments.
Abstract
Memory capacity is a critical factor determining the performance of Vision-Language-Action (VLA) models in long-horizon manipulation tasks. Existing memory-augmented architectures primarily rely on linear or flat storage, lacking structural priors for manipulation categories and hierarchical organization. This deficiency hinders efficient experience retrieval and limits generalization to unseen long-horizon task compositions. Inspired by the hierarchical organization of human experience, we propose ECHO (Experience Consolidation and Hierarchical Organization), a novel memory framework operating within a Continuous Hierarchical Space. By employing a hyperbolic autoencoder, ECHO maps VLA hidden states into this space. Leveraging hyperbolic metrics and entailment constraint mechanisms, experience vectors are organized into a semantic memory tree that supports efficient top-down retrieval.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
