PixelArena: A benchmark for Pixel-Precision Visual Intelligence
Feng Liang, Sizhe Cheng, Chenqi Yi, Yong Wang

TL;DR
PixelArena introduces a benchmark using semantic segmentation to objectively evaluate the fine-grained visual intelligence of multimodal image generation models, revealing emergent capabilities in the latest Gemini 3 Pro Image.
Contribution
The paper proposes a novel benchmark for pixel-precision evaluation of multimodal image generation models using semantic segmentation tasks, addressing limitations of aesthetic-focused benchmarks.
Findings
Gemini 3 Pro Image exhibits high-fidelity semantic mask generation in zero-shot settings
Benchmark reveals emergent visual intelligence capabilities in recent models
Provides insights into model generalization and future research directions
Abstract
Omni-modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine-grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ethics and Social Impacts of AI
