iFlyBot-VLA Technical Report
Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan

TL;DR
iFlyBot-VLA introduces a large-scale vision-language-action model trained with a novel dual-level action framework, combining latent and structured actions, to improve robotic manipulation understanding and performance.
Contribution
The paper presents a new VLA model with a dual-level action representation and a mixed training strategy, enhancing perception and reasoning in robotic manipulation tasks.
Findings
Outperforms existing methods on LIBERO Franka benchmark
Achieves high success rates in real-world manipulation tasks
Effectively integrates latent and structured actions for better control
Abstract
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
