Equivariant Similarity for Vision-Language Foundation Models

Tan Wang; Kevin Lin; Linjie Li; Chung-Ching Lin; Zhengyuan Yang,; Hanwang Zhang; Zicheng Liu; Lijuan Wang

arXiv:2303.14465·cs.CV·October 10, 2023·1 cites

Equivariant Similarity for Vision-Language Foundation Models

Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang,, Hanwang Zhang, Zicheng Liu, Lijuan Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces EqSim, a regularization loss to improve equivariance in vision-language models, and presents EqBen, a benchmark for evaluating visual-minimal change equivariance, revealing current models' limitations.

Contribution

It proposes a novel regularization loss for enhancing equivariance in VLMs and introduces a new benchmark for assessing visual-minimal change equivariance.

Findings

01

Current VLMs lack sufficient equivariance.

02

EqSim effectively improves equivariance in models.

03

EqBen provides a new challenging evaluation for visual-minimal changes.

Abstract

This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. This allows VLMs to generalize better to nuanced and unseen multimodal compositions. However, modeling equivariance is challenging as the ground truth of semantic change is difficult to collect. For example, given an image-text pair about a dog, it is unclear to what extent the similarity changes when the pixel is changed from dog to cat? To this end, we propose EqSim, a regularization loss that can be efficiently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangt-cn/eqben
pytorchOfficial

Datasets

ytaek-oh/eqben-images
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning