TL;DR
This paper introduces masked diffusion captioning (MDC), a novel method for learning visual features by training a masked diffusion language model to generate image captions, which performs competitively on downstream tasks.
Contribution
The paper proposes a new masked diffusion captioning approach that reduces reliance on token position and auxiliary objectives, improving visual feature learning.
Findings
Learned features are competitive with autoregressive methods.
MDC reduces dependence on token position in training.
Features perform well across various datasets and models.
Abstract
We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
