Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions

Cunxin Fan; Xiaosong Jia; Yihang Sun; Yixiao Wang; Jianglan Wei; Ziyang Gong; Xiangyu Zhao; Masayoshi Tomizuka; Xue Yang; Junchi Yan; Mingyu Ding

arXiv:2505.02152·cs.RO·October 9, 2025

Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, Mingyu Ding

PDF

Open Access

TL;DR

Interleave-VLA introduces a novel robot learning paradigm that uses interleaved image-text instructions, significantly improving zero-shot generalization and handling diverse, unseen tasks in real-world manipulation scenarios.

Contribution

It is the first framework enabling robots to understand interleaved image-text instructions and generate continuous actions, extending vision-language-action models with minimal modifications.

Findings

01

Doubles out-of-domain generalization to unseen objects.

02

Supports flexible, zero-shot task instructions including sketches.

03

Creates a large-scale real-world interleaved embodied dataset with 210k episodes.

Abstract

The rise of foundation models paves the way for generalist robot policies in the physical world. Existing methods relying on text-only instructions often struggle to generalize to unseen scenarios. We argue that interleaved image-text inputs offer richer and less biased context and enable robots to better handle unseen tasks with more versatile human-robot interaction. Building on this insight, Interleave-VLA, the first robot learning paradigm capable of comprehending interleaved image-text instructions and directly generating continuous action sequences in the physical world, is introduced. It offers a natural, flexible, and model-agnostic paradigm that extends state-of-the-art vision-language-action (VLA) models with minimal modifications while achieving strong zero-shot generalization. Interleave-VLA also includes an automatic pipeline that converts text instructions from Open…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems