Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models
Boshi An, Chenyu Yang, Robert Katzschmann

TL;DR
This paper adapts a pre-trained vision-language-action model for dexterous human-robot collaboration, enhancing perception and control with minimal language prompts, and demonstrates real-time performance on complex tasks.
Contribution
It introduces FiLM conditioning, an auxiliary intent head, and action-space post-processing to improve dexterous robot collaboration with minimal prompts.
Findings
Delta actions are well-behaved and explain ~96% of hand-joint variance.
Action post-processing significantly improves performance.
Real-time system achieves ~0.3 s latency for complex behaviors.
Abstract
We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
