TL;DR
This paper introduces RIVRL, a novel video representation learning method inspired by human reading strategies, which improves text-to-video retrieval by capturing both overview and detailed video features, achieving state-of-the-art results.
Contribution
Proposes a reading-strategy inspired dual-branch framework for video representation learning that enhances cross-modal retrieval performance.
Findings
Achieves new state-of-the-art on TGIF and VATEX datasets.
Performs comparably or better than models trained on large-scale datasets.
Effectively captures both overview and detailed video information.
Abstract
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttentive Walk-Aggregating Graph Neural Network
