TL;DR
This paper introduces a new benchmark for evaluating video-text retrieval models under query shifts, analyzes the hubness problem, and proposes HAT-VTR, a test-time adaptation method that significantly improves robustness.
Contribution
The paper presents a comprehensive benchmark for query shifts in video-text retrieval and proposes HAT-VTR, a novel test-time adaptation framework addressing hubness and improving robustness.
Findings
HAT-VTR outperforms prior methods across various query shift scenarios.
Query shifts significantly increase hubness in video-text retrieval.
The benchmark reveals diverse types and severities of video perturbations affecting performance.
Abstract
Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
