In Pursuit of Pixel Supervision for Visual Pre-training

Lihe Yang; Shang-Wen Li; Yang Li; Xinjie Lei; Dong Wang; Abdelrahman Mohamed; Hengshuang Zhao; Hu Xu

arXiv:2512.15715·cs.CV·December 18, 2025

In Pursuit of Pixel Supervision for Visual Pre-training

Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu

PDF

Open Access 5 Models 1 Datasets

TL;DR

Pixio, an enhanced masked autoencoder trained on billions of web images, demonstrates that pixel-level self-supervised learning remains competitive and effective for diverse downstream visual tasks.

Contribution

The paper introduces Pixio, a robust, scalable autoencoder-based model with challenging pre-training tasks, showing strong performance across multiple vision applications.

Findings

01

Pixio outperforms or matches DINOv3 on various tasks.

02

Pixel-space self-supervised learning is a viable alternative to latent-space methods.

03

Pixio is trained on 2 billion images with minimal human curation.

Abstract

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

leandrehonore/Pixio_base_blind_spots
dataset· 4.2k dl
4.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications