Text-image Alignment for Diffusion-based Perception
Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogerio Guimaraes,, Pietro Perona

TL;DR
This paper explores how to effectively utilize diffusion models for various vision tasks by leveraging automatically generated captions to improve text-image alignment, leading to state-of-the-art results across multiple benchmarks.
Contribution
It introduces a method that uses caption generation to enhance diffusion model alignment, achieving new state-of-the-art performance in semantic segmentation, depth estimation, object detection, and domain adaptation.
Findings
Improved cross-attention maps enhance perceptual performance.
Achieved SOTA in diffusion-based semantic segmentation and depth estimation.
Demonstrated effective cross-domain generalization and adaptation.
Abstract
Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
MethodsConvolution · 1x1 Convolution · Feature Pyramid Network · Diffusion
