Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

Vatsal Agarwal; Matthew Gwilliam; Gefen Kohavi; Eshan Verma; Daniel Ulbricht; Abhinav Shrivastava

arXiv:2507.07106·cs.CV·July 10, 2025

Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

Vatsal Agarwal, Matthew Gwilliam, Gefen Kohavi, Eshan Verma, Daniel Ulbricht, Abhinav Shrivastava

PDF

Open Access

TL;DR

This paper explores using pre-trained text-to-image diffusion models as visual encoders for multimodal tasks, showing they provide rich, instruction-aware features that improve visual question answering and reasoning capabilities.

Contribution

It introduces a novel approach of leveraging diffusion model features for multimodal understanding, addressing limitations of existing encoders like CLIP, and proposes methods to align and fuse these features with language models.

Findings

01

Diffusion features are rich in semantics and encode strong image-text alignment.

02

Text conditioning in diffusion models helps focus on relevant image regions.

03

Fusion of CLIP and diffusion features improves performance on VQA and reasoning tasks.

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning