HierVL: Learning Hierarchical Video-Language Embeddings

Kumar Ashutosh; Rohit Girdhar; Lorenzo Torresani; Kristen Grauman

arXiv:2301.02311·cs.CV·June 9, 2023

HierVL: Learning Hierarchical Video-Language Embeddings

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

PDF

Open Access 1 Repo

TL;DR

HierVL introduces a hierarchical approach to video-language embeddings that captures both short-term and long-term associations, improving understanding of video content and context for various tasks.

Contribution

The paper presents HierVL, a novel hierarchical contrastive training method that aligns text and visual data at multiple levels, enhancing long-term video understanding.

Findings

01

Outperforms single-level embeddings in short-term video tasks

02

Achieves state-of-the-art results on long-term video modeling benchmarks

03

Successfully transfers to multiple downstream tasks in zero-shot and fine-tuned settings

Abstract

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/hiervl
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning