Text-image Alignment for Diffusion-based Perception

Neehar Kondapaneni; Markus Marks; Manuel Knott; Rogerio Guimaraes,; Pietro Perona

arXiv:2310.00031·cs.CV·April 2, 2024·2 cites

Text-image Alignment for Diffusion-based Perception

Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogerio Guimaraes,, Pietro Perona

PDF

Open Access 2 Repos

TL;DR

This paper explores how to effectively utilize diffusion models for various vision tasks by leveraging automatically generated captions to improve text-image alignment, leading to state-of-the-art results across multiple benchmarks.

Contribution

It introduces a method that uses caption generation to enhance diffusion model alignment, achieving new state-of-the-art performance in semantic segmentation, depth estimation, object detection, and domain adaptation.

Findings

01

Improved cross-attention maps enhance perceptual performance.

02

Achieved SOTA in diffusion-based semantic segmentation and depth estimation.

03

Demonstrated effective cross-domain generalization and adaptation.

Abstract

Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques

MethodsConvolution · 1x1 Convolution · Feature Pyramid Network · Diffusion