Cross-modal Attention Congruence Regularization for Vision-Language   Relation Alignment

Rohan Pandey; Rulin Shao; Paul Pu Liang; Ruslan Salakhutdinov,; Louis-Philippe Morency

arXiv:2212.10549·cs.CL·July 6, 2023

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

Rohan Pandey, Rulin Shao, Paul Pu Liang, Ruslan Salakhutdinov,, Louis-Philippe Morency

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel regularization technique called Cross-modal Attention Congruence Regularization (CACR) that enforces relation-level alignment between vision and language attention, improving compositional generalization in vision-language models.

Contribution

The paper proposes a new relation alignment regularization method that aligns directed attention between text and images, enhancing model understanding of semantic relations.

Findings

01

Improved Winoground benchmark performance

02

Enhanced relation-level alignment in vision-language models

03

Proven equivalence to attention matrix congruence under a change of basis

Abstract

Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., "mug in grass") with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from 'mug' to 'grass' (capturing the semantic relation 'in') to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lancopku/IAIS
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsUNiversal Image-TExt Representation Learning