PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Wenqi Liang; Gan Sun; Yao He; Jiahua Dong; Suyan Dai; Ivan Laptev; Salman Khan; Yang Cong

arXiv:2511.01571·cs.CV·March 24, 2026

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong

PDF

Open Access 3 Reviews

TL;DR

PixelVLA introduces a novel vision-language-action model that enhances pixel-level scene understanding and multimodal prompting, significantly improving robot manipulation success rates with reduced training costs.

Contribution

It is the first VLA model supporting pixel-level reasoning and multimodal prompts, built on a new instruction tuning framework and a large-scale pixel annotation dataset.

Findings

01

Improves manipulation success rates by up to 28.7%.

02

Requires only 1.5% of the pretraining cost of previous models.

03

Demonstrates enhanced accuracy and versatility in complex environments.

Abstract

Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual promptaware encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper is well organized and easy to follow. 2. According to the results shown in Table 2, Table 3 and Table 4, it seems, add more pixel level prompting and intergrate this as a new modal can help increase the action control accuracy.

Weaknesses

1. I am confused about the design of the whole architecture. It seems, the author proposes a two-state automated pipeline to get the pixel level segmentation, how these segmentations are used? it is not very clear that which part these segmentation masks are used for in the model. 2. If the segmentation is used as a input to learn the pixel-aware embedding, i am not sure the final optimization of loss for these visual encoder is action accuracy? seems not very relevent. What is the motivation o

Reviewer 02Rating 6Confidence 3

Strengths

The paper is clearly written and well-structured, making it easy to follow. The experiments conducted on SimplerEnv and LIBERO are appropriate and demonstrate the effectiveness of the proposed approach. While introducing an additional pixel-level encoder could intuitively downgrade the pretrained VLM, the authors successfully solve this issue by curating a large 160K dataset and applying LoRA fine-tuning.

Weaknesses

The introduction of pixel-level annotations can be viewed as a relatively straightforward extension of prior work on visual prompting and image-level feature adaptation (e.g., TraceVLA, LLaRA, and related approaches). As a result, the paper’s novelty is somewhat limited. Nonetheless, the work offers useful insights and has potential value for the research community, particularly as a good practice in bridging pixel-level understanding with pretrained VLMs for VLAs. Meanwhile, the authors are h

Reviewer 03Rating 6Confidence 4

Strengths

- Meaningful technical contribution in Pixel-160k dataset and in constructing a VLA that takes advantage of multimodal prompts. - Experiments test two SOTA VLA architectures, showing that PixelVLA can be built on top of multiple types of VLAs.

Weaknesses

- Unclear whether method is feasible to transfer to novel environments, due to lack of real world experiments. - Analysis of results are lacking and leave out failure cases; for example, why does PixelVLA perform so well on Libero Long but struggle on the Object/Goal splits (Table 3)? Why would the pixel-level understanding training damage performance on the open/close drawer task (Table 4)?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics