PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Zhiheng Lyu; Xueguang Ma; Wenhu Chen

arXiv:2501.19339·cs.CV·October 23, 2025

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Zhiheng Lyu, Xueguang Ma, Wenhu Chen

PDF

Open Access 1 Datasets

TL;DR

PixelWorld introduces a unified pixel-based perception benchmark for vision-language models, demonstrating comparable performance to token-based methods in understanding tasks and highlighting challenges in reasoning tasks.

Contribution

The paper presents PixelWorld, a novel benchmark that renders diverse modalities into a shared pixel space, enabling evaluation of unified perception models across multiple tasks.

Findings

01

Vision transformers can partially capture textual semantics from pixels.

02

Performance drops in reasoning tasks, mitigated by Chain-of-Thought prompting.

03

Representing all modalities as pixels simplifies preprocessing and reduces misalignment.

Abstract

Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TIGER-Lab/PixelWorld
dataset· 737 dl
737 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer