Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
Jian Xiao, Zijie Song, Jialong Hu, Hao Cheng, Jia Li, Zhenzhen Hu, Richang Hong

TL;DR
This paper introduces GARE, a novel framework for text-video retrieval that mitigates optimization tension caused by modality gaps and noisy negatives, leading to improved alignment accuracy and robustness.
Contribution
GARE employs a learnable, pair-specific increment guided by a Taylor expansion and a semantic gap-conditioned neural module, enhancing contrastive learning in text-video retrieval.
Findings
Consistently improves alignment accuracy across four benchmarks.
Enhances robustness against noisy negatives and modality gaps.
Demonstrates effectiveness of gap-aware tension mitigation.
Abstract
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment between text and video , redistributing gradients to relieve optimization tension and absorb noise. We derive via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsContrastive Learning · InfoNCE
