An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in   Diffusion Models

Zizhao Hu; Shaochong Jia; Mohammad Rostami

arXiv:2403.16530·cs.CV·March 26, 2024·2 cites

An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Zizhao Hu, Shaochong Jia, Mohammad Rostami

PDF

Open Access

TL;DR

This paper introduces an intermediate fusion strategy in Vision Transformer models that enhances text-image alignment and generation quality in diffusion models while also improving computational efficiency.

Contribution

The paper proposes a novel intermediate fusion approach for vision-language models that outperforms early fusion in alignment quality and efficiency.

Findings

01

Higher CLIP Score and lower FID with intermediate fusion

02

20% reduction in FLOPs compared to early fusion

03

50% increase in training speed with the new fusion method

Abstract

Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN · Contrastive Language-Image Pre-training