EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, Chenyan Xiong

TL;DR
EmbodiedMidtrain enhances vision-language models for embodied tasks by selectively mid-training on VLA-aligned data, improving downstream robot manipulation performance.
Contribution
This work introduces a data-driven mid-training approach that bridges the gap between VLMs and VLAs, improving their suitability for embodied applications.
Findings
Mid-training improves VLA performance across multiple benchmarks.
The data engine effectively identifies VLA-aligned data, enhancing model initialization.
Mid-training benefits are evident from early training stages.
Abstract
Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
