Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

Boshi An; Chenyu Yang; Robert Katzschmann

arXiv:2510.25713·cs.RO·October 30, 2025

Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

Boshi An, Chenyu Yang, Robert Katzschmann

PDF

TL;DR

This paper adapts a pre-trained vision-language-action model for dexterous human-robot collaboration, enhancing perception and control with minimal language prompts, and demonstrates real-time performance on complex tasks.

Contribution

It introduces FiLM conditioning, an auxiliary intent head, and action-space post-processing to improve dexterous robot collaboration with minimal prompts.

Findings

01

Delta actions are well-behaved and explain ~96% of hand-joint variance.

02

Action post-processing significantly improves performance.

03

Real-time system achieves ~0.3 s latency for complex behaviors.

Abstract

We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.