iFlyBot-VLM Technical Report
Xin Nie, Zhiyuan Cheng, Yuan Zhang, Chao Ji, Jiajia Wu, Yuhan Zhang, Jia Pan

TL;DR
iFlyBot-VLM is a versatile vision-language model designed to enhance embodied intelligence in robots by bridging perception and action through a transferable operational language, enabling diverse robotic tasks and coordination.
Contribution
The paper introduces iFlyBot-VLM, a novel general-purpose VLM that abstracts complex visual data into a transferable language for improved embodied AI capabilities.
Findings
Achieved optimal performance on multiple embodied intelligence benchmarks.
Demonstrated scalable and generalizable capabilities across robotic platforms.
Enabled seamless perception-action coordination in diverse tasks.
Abstract
We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) used to improve the domain of Embodied Intelligence. The central objective of iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robotic motion control. To this end, the model abstracts complex visual and spatial information into a body-agnostic and transferable Operational Language, thereby enabling seamless perception-action closed-loop coordination across diverse robotic platforms. The architecture of iFlyBot-VLM is systematically designed to realize four key functional capabilities essential for embodied intelligence: 1) Spatial Understanding and Metric Reasoning; 2) Interactive Target Grounding; 3) Action Abstraction and Control Parameter Generation; 4) Task Planning and Skill Sequencing. We envision iFlyBot-VLM as a scalable and generalizable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
