iFlyBot-VLA Technical Report

Yuan Zhang; Chenyu Xue; Wenjie Xu; Chao Ji; Jiajia wu; Jia Pan

arXiv:2511.01914·cs.CV·November 5, 2025

iFlyBot-VLA Technical Report

Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan

PDF

Open Access

TL;DR

iFlyBot-VLA introduces a large-scale vision-language-action model trained with a novel dual-level action framework, combining latent and structured actions, to improve robotic manipulation understanding and performance.

Contribution

The paper presents a new VLA model with a dual-level action representation and a mixed training strategy, enhancing perception and reasoning in robotic manipulation tasks.

Findings

01

Outperforms existing methods on LIBERO Franka benchmark

02

Achieves high success rates in real-world manipulation tasks

03

Effectively integrates latent and structured actions for better control

Abstract

We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition