PixelWorld: How Far Are We from Perceiving Everything as Pixels?
Zhiheng Lyu, Xueguang Ma, Wenhu Chen

TL;DR
PixelWorld introduces a unified pixel-based perception benchmark for vision-language models, demonstrating comparable performance to token-based methods in understanding tasks and highlighting challenges in reasoning tasks.
Contribution
The paper presents PixelWorld, a novel benchmark that renders diverse modalities into a shared pixel space, enabling evaluation of unified perception models across multiple tasks.
Findings
Vision transformers can partially capture textual semantics from pixels.
Performance drops in reasoning tasks, mitigated by Chain-of-Thought prompting.
Representing all modalities as pixels simplifies preprocessing and reduces misalignment.
Abstract
Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer
