Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Sangoh Lee; Sangwoo Mo; Wook-Shin Han

arXiv:2512.20014·cs.RO·January 30, 2026

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Sangoh Lee, Sangwoo Mo, Wook-Shin Han

PDF

Open Access

TL;DR

This paper introduces Visual Attentive Prompting (VAP), a training-free method that enhances vision-language-action models to accurately manipulate personalized objects specified by few reference images, improving success in real-world tasks.

Contribution

VAP provides a novel, training-free visual prompting approach that enables frozen VLAs to perform personalized object manipulation using visual memory and grounding techniques.

Findings

01

VAP outperforms generic policies in success rate.

02

VAP improves correct-object manipulation accuracy.

03

VAP bridges semantic understanding and instance-level control.

Abstract

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications