ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning
Tuan Van Vo, Tan Quang Nguyen, Khang Minh Nguyen, Duy Ho Minh Nguyen, Minh Nhat Vu

TL;DR
ReFineVLA introduces a reasoning-aware fine-tuning framework for vision-language-action models, enhancing their interpretability and performance in robotic manipulation tasks by incorporating expert-generated rationales.
Contribution
It proposes a novel method to augment VLA models with reasoning rationales and fine-tune them, improving reasoning capabilities and task success rates.
Findings
Achieves 5.0% higher success rate on manipulation tasks.
Enhances attention focus on relevant objects and actions.
Outperforms state-of-the-art baselines in various settings.
Abstract
Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose \textit{ReFineVLA}, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we use \textit{ReFineVLA} to fine-tune pre-trained VLAs with the reasoning-enriched datasets, while maintaining their inherent generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Model-Driven Software Engineering Techniques
