MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models

Zheyuan Zhou; Liang Du; Zixun Sun; Xiaoyu Zhou; Ruimin Ye; Qihao Chen; Yinda Chen; Lemiao Qiu

arXiv:2602.02212·cs.CV·February 3, 2026

MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models

Zheyuan Zhou, Liang Du, Zixun Sun, Xiaoyu Zhou, Ruimin Ye, Qihao Chen, Yinda Chen, Lemiao Qiu

PDF

Open Access

TL;DR

MAIN-VLA introduces a novel framework that models intention and environment abstractions to improve decision-making efficiency and generalization in complex vision-language-action tasks within open-world environments.

Contribution

It proposes explicit intention and environment abstractions for better semantic grounding and introduces a parameter-free token-pruning method to reduce perceptual redundancy.

Findings

01

Achieves state-of-the-art performance in Minecraft and PvP environments.

02

Demonstrates improved decision quality and generalization.

03

Reduces inference complexity without performance loss.

Abstract

Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Social Robot Interaction and HRI