Disentangled Representation Learning for Text-Video Retrieval

Qiang Wang; Yanhao Zhang; Yun Zheng; Pan Pan; Xian-Sheng Hua

arXiv:2203.07111·cs.CV·March 15, 2022·41 cites

Disentangled Representation Learning for Text-Video Retrieval

Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, Xian-Sheng Hua

PDF

Open Access 2 Repos

TL;DR

This paper introduces a disentangled framework for Text-Video Retrieval that improves cross-modality interaction by decoupling content and correlation, leading to significant performance gains on multiple benchmarks.

Contribution

It proposes a novel Weighted Token-wise Interaction module and Channel DeCorrelation Regularization to enhance hierarchical and sequential representation learning in TVR.

Findings

01

Outperforms existing methods on multiple benchmarks.

02

Achieves up to +7.9% R@1 improvement on MSVD.

03

Demonstrates the effectiveness of disentangled representations in cross-modal retrieval.

Abstract

Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance. This paper first studies the interaction paradigm in depth, where we find that its computation can be split into two terms, the interaction contents at different granularity and the matching function to distinguish pairs with the same semantics. We also observe that the single-vector representation and implicit intensive function substantially hinder the optimization. Based on these findings, we propose a disentangled framework to capture a sequential and hierarchical representation. Firstly, considering the natural sequential structure in both text and video inputs, a Weighted Token-wise Interaction (WTI) module is performed to decouple the content and adaptively exploit the pair-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization