Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim; Chelsea Finn; Percy Liang

arXiv:2502.19645·cs.RO·April 29, 2025

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, Percy Liang

PDF

1 Repo 10 Models

TL;DR

This paper presents an optimized fine-tuning approach for vision-language-action models that significantly improves their efficiency, success rates, and flexibility in robotic tasks, demonstrated through state-of-the-art results in simulation and real-world experiments.

Contribution

The paper introduces a comprehensive fine-tuning recipe for VLAs, including parallel decoding, action chunking, continuous action representation, and L1 regression, leading to OpenVLA-OFT with superior performance.

Findings

01

Achieved 97.1% success rate on LIBERO benchmark, up from 76.5%.

02

Increased action generation throughput by 26 times.

03

Enabled high-frequency dexterous control on a bimanual robot.

Abstract

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

moojink/openvla-oft
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection