ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
Tuan Van Vo, Tan Q. Nguyen, Khang Nguyen, Nhat Xuan Tran, Duy H. M. Nguyen, An T. Le, Ngo Anh Vien, Minh Nhat Vu

TL;DR
ReFineVLA introduces a teacher-guided fine-tuning approach to enhance reasoning and interpretability in vision-language-action models for robotic manipulation, achieving state-of-the-art results.
Contribution
The paper proposes a novel framework that incorporates reasoning rationales into VLAs, improving their reasoning, interpretability, and generalization in complex manipulation tasks.
Findings
ReFineVLA outperforms existing methods on WidowX and Google Robot benchmarks.
Attention maps show improved focus on relevant visual and linguistic cues.
Reasoning-enriched training enhances multimodal understanding and task success rates.
Abstract
Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into desired robotic actions. Despite their advancements, VLAs often overlook explicit reasoning and learn the functional input-action mappings, omitting crucial logical steps, which are especially pronounced in interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we fine-tune pre-trained VLAs with the reasoning-enriched datasets with ReFineVLA, while maintaining the underlying generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
