V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Han Lin; Xichen Pan; Zun Wang; Yue Zhang; Chu Wang; Jaemin Cho; Mohit Bansal

arXiv:2603.16792·cs.CV·March 18, 2026

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal

PDF

Open Access

TL;DR

V-Co systematically studies visual co-denoising in pixel-space diffusion models, identifying four key ingredients that improve semantic alignment and outperform baseline methods on ImageNet-256.

Contribution

This paper introduces a unified framework for visual co-denoising, clarifies essential design choices, and provides a practical recipe for enhancing diffusion models with visual feature alignment.

Findings

01

V-Co outperforms baseline pixel-space diffusion models on ImageNet-256.

02

Four key ingredients are identified for effective visual co-denoising.

03

V-Co achieves better results with fewer training epochs.

Abstract

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Image Enhancement Techniques