Key-Gram: Extensible World Knowledge for Embodied Manipulation
Jingjing Fan,Siyuan Li,Botao Ren,Zhidong Deng

TL;DR
Key-Gram introduces an external memory module that separates language knowledge from visual reasoning, enhancing embodied manipulation tasks by improving compositional grounding and transfer capabilities.
Contribution
It presents a novel external memory framework that decouples language-derived knowledge from visual reasoning, enabling extensible and efficient instruction understanding in embodied control.
Findings
Consistently improves performance across multiple manipulation benchmarks.
Achieves significant relative gains in success rates without fine-tuning.
Demonstrates effective transfer and real-world manipulation improvements.
Abstract
Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
