BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Yucheng Hu; Jianke Zhang; Yuanfei Luo; Yanjiang Guo; Xiaoyu Chen; Xinshu Sun; Kun Feng; Qingzhou Lu; Sheng Chen; Yangang Zhang; Wei Li; Jianyu Chen

arXiv:2602.09849·cs.RO·February 12, 2026

BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, Jianyu Chen

PDF

Open Access

TL;DR

BagelVLA is a unified model that combines linguistic reasoning, visual forecasting, and action generation to improve long-horizon manipulation tasks in embodied agents, outperforming existing methods.

Contribution

It introduces BagelVLA, a novel integrated framework with Residual Flow Guidance for efficient multi-modal reasoning and action planning in complex manipulation tasks.

Findings

01

Outperforms baselines on simulated benchmarks

02

Effective in multi-stage reasoning tasks

03

Reduces latency in action generation

Abstract

Equipping embodied agents with the ability to reason about tasks, foresee physical outcomes, and generate precise actions is essential for general-purpose manipulation. While recent Vision-Language-Action (VLA) models have leveraged pre-trained foundation models, they typically focus on either linguistic planning or visual forecasting in isolation. These methods rarely integrate both capabilities simultaneously to guide action generation, leading to suboptimal performance in complex, long-horizon manipulation tasks. To bridge this gap, we propose BagelVLA, a unified model that integrates linguistic planning, visual forecasting, and action generation within a single framework. Initialized from a pretrained unified understanding and generative model, BagelVLA is trained to interleave textual reasoning and visual prediction directly into the action execution loop. To efficiently couple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning