PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang; Sizhe Cheng; Chenqi Yi; Yong Wang

arXiv:2512.16303·cs.CV·January 12, 2026

PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang, Sizhe Cheng, Chenqi Yi, Yong Wang

PDF

Open Access

TL;DR

PixelArena introduces a benchmark using semantic segmentation to objectively evaluate the fine-grained visual intelligence of multimodal image generation models, revealing emergent capabilities in the latest Gemini 3 Pro Image.

Contribution

The paper proposes a novel benchmark for pixel-precision evaluation of multimodal image generation models using semantic segmentation tasks, addressing limitations of aesthetic-focused benchmarks.

Findings

01

Gemini 3 Pro Image exhibits high-fidelity semantic mask generation in zero-shot settings

02

Benchmark reveals emergent visual intelligence capabilities in recent models

03

Provides insights into model generalization and future research directions

Abstract

Omni-modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine-grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ethics and Social Impacts of AI