Equivariant Similarity for Vision-Language Foundation Models
Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang,, Hanwang Zhang, Zicheng Liu, Lijuan Wang

TL;DR
This paper introduces EqSim, a regularization loss to improve equivariance in vision-language models, and presents EqBen, a benchmark for evaluating visual-minimal change equivariance, revealing current models' limitations.
Contribution
It proposes a novel regularization loss for enhancing equivariance in VLMs and introduces a new benchmark for assessing visual-minimal change equivariance.
Findings
Current VLMs lack sufficient equivariance.
EqSim effectively improves equivariance in models.
EqBen provides a new challenging evaluation for visual-minimal changes.
Abstract
This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. This allows VLMs to generalize better to nuanced and unseen multimodal compositions. However, modeling equivariance is challenging as the ground truth of semantic change is difficult to collect. For example, given an image-text pair about a dog, it is unclear to what extent the similarity changes when the pixel is changed from dog to cat? To this end, we propose EqSim, a regularization loss that can be efficiently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
