Masked Diffusion Captioning for Visual Feature Learning

Chao Feng; Zihao Wei; Andrew Owens

arXiv:2510.26799·cs.CV·October 31, 2025

Masked Diffusion Captioning for Visual Feature Learning

Chao Feng, Zihao Wei, Andrew Owens

PDF

1 Video

TL;DR

This paper introduces masked diffusion captioning (MDC), a novel method for learning visual features by training a masked diffusion language model to generate image captions, which performs competitively on downstream tasks.

Contribution

The paper proposes a new masked diffusion captioning approach that reduces reliance on token position and auxiliary objectives, improving visual feature learning.

Findings

01

Learned features are competitive with autoregressive methods.

02

MDC reduces dependence on token position in training.

03

Features perform well across various datasets and models.

Abstract

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Masked Diffusion Captioning for Visual Feature Learning· underline