TL;DR
ReTrack is a novel dual-stream network that calibrates directional bias in composed video retrieval, significantly improving multi-modal query understanding and achieving state-of-the-art results on multiple benchmarks.
Contribution
It introduces the first framework to explicitly calibrate directional bias in composed features for improved multi-modal video retrieval.
Findings
ReTrack achieves state-of-the-art performance on three benchmark datasets.
It effectively disentangles semantic contributions of video and text modalities.
ReTrack generalizes well to composed image retrieval tasks.
Abstract
With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
