Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting
Sangoh Lee, Sangwoo Mo, Wook-Shin Han

TL;DR
This paper introduces Visual Attentive Prompting (VAP), a training-free method that enhances vision-language-action models to accurately manipulate personalized objects specified by few reference images, improving success in real-world tasks.
Contribution
VAP provides a novel, training-free visual prompting approach that enables frozen VLAs to perform personalized object manipulation using visual memory and grounding techniques.
Findings
VAP outperforms generic policies in success rate.
VAP improves correct-object manipulation accuracy.
VAP bridges semantic understanding and instance-level control.
Abstract
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
