Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

Jinkun Liu; Haohan Chi; Lingfeng Zhang; Yifan Xie; YuAn Wang; Long Chen; Hangjun Ye; Xiaoshuai Hao; Wenbo Ding

arXiv:2605.00438·cs.AI·May 4, 2026

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, YuAn Wang, Long Chen, Hangjun Ye, Xiaoshuai Hao, Wenbo Ding

PDF

TL;DR

This paper introduces IVLR, a novel robot manipulation policy that uses explicit interleaved vision and language reasoning traces to improve long-horizon task planning and execution.

Contribution

It presents a new framework with a global semantic-geometric trace generated by a multimodal transformer, enabling better planning and robustness in robotic manipulation.

Findings

01

Achieves 95.5% success on LIBERO benchmark, including 92.4% on LIBERO-Long.

02

Both visual and textual modalities are essential for high performance.

03

The trace-based approach shows robustness to execution perturbations and moderate drift.

Abstract

Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.