Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang; Xueqi Li; Yiyang Chen; Jin Song; Yihan Wang; Zipeng Xiao; Jiadi Su; You Qiaoben; Pengfei Liu; Zhijie Deng

arXiv:2511.16175·cs.CV·February 24, 2026

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng

PDF

Open Access 6 Models 1 Datasets

TL;DR

Mantis introduces a disentangled visual foresight framework that enhances vision-language-action models by improving comprehension, reasoning, and action prediction through a novel decoupled architecture and training on diverse datasets.

Contribution

The paper proposes a novel Disentangled Visual Foresight (DVF) framework with meta queries and a diffusion Transformer head, enabling better visual state prediction and reasoning in VLA models.

Findings

01

Achieves 96.7% success rate on LIBERO benchmark after fine-tuning.

02

Outperforms existing models in instruction-following and generalization.

03

Demonstrates high convergence speed and improved reasoning capabilities.

Abstract

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Yysrc/mantis_libero_lerobot
dataset· 108 dl
108 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition