Disentangled Representation Learning for Text-Video Retrieval
Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, Xian-Sheng Hua

TL;DR
This paper introduces a disentangled framework for Text-Video Retrieval that improves cross-modality interaction by decoupling content and correlation, leading to significant performance gains on multiple benchmarks.
Contribution
It proposes a novel Weighted Token-wise Interaction module and Channel DeCorrelation Regularization to enhance hierarchical and sequential representation learning in TVR.
Findings
Outperforms existing methods on multiple benchmarks.
Achieves up to +7.9% R@1 improvement on MSVD.
Demonstrates the effectiveness of disentangled representations in cross-modal retrieval.
Abstract
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance. This paper first studies the interaction paradigm in depth, where we find that its computation can be split into two terms, the interaction contents at different granularity and the matching function to distinguish pairs with the same semantics. We also observe that the single-vector representation and implicit intensive function substantially hinder the optimization. Based on these findings, we propose a disentangled framework to capture a sequential and hierarchical representation. Firstly, considering the natural sequential structure in both text and video inputs, a Weighted Token-wise Interaction (WTI) module is performed to decouple the content and adaptively exploit the pair-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
