Pixelis: Reasoning in Pixels, from Seeing to Acting
Yunpeng Zhou

TL;DR
Pixelis is a pixel-space agent that learns to reason and act directly on images and videos through executable operations, enabling physically grounded visual intelligence and adaptive behavior without external feedback.
Contribution
It introduces Pixelis, a novel pixel-space reasoning framework with a three-phase training process for improved visual reasoning and action in images and videos.
Findings
Achieves +4.08% average gain over baseline on six benchmarks.
Produces shorter, more auditable toolchains.
Maintains in-corridor KL during test-time learning.
Abstract
Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Embodied and Extended Cognition · Face Recognition and Perception
