Pixelis: Reasoning in Pixels, from Seeing to Acting

Yunpeng Zhou

arXiv:2603.25091·cs.CV·March 27, 2026

Pixelis: Reasoning in Pixels, from Seeing to Acting

Yunpeng Zhou

PDF

Open Access

TL;DR

Pixelis is a pixel-space agent that learns to reason and act directly on images and videos through executable operations, enabling physically grounded visual intelligence and adaptive behavior without external feedback.

Contribution

It introduces Pixelis, a novel pixel-space reasoning framework with a three-phase training process for improved visual reasoning and action in images and videos.

Findings

01

Achieves +4.08% average gain over baseline on six benchmarks.

02

Produces shorter, more auditable toolchains.

03

Maintains in-corridor KL during test-time learning.

Abstract

Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Embodied and Extended Cognition · Face Recognition and Perception