Contrastive Language Video Time Pre-training

Hengyue Liu; Kyle Min; Hector A. Valdez; Subarna Tripathi

arXiv:2406.02631·cs.CV·June 6, 2024

Contrastive Language Video Time Pre-training

Hengyue Liu, Kyle Min, Hector A. Valdez, Subarna Tripathi

PDF

Open Access

TL;DR

LAVITI is a contrastive learning approach that aligns language, video, and temporal features in long-form videos, enabling efficient training and state-of-the-art action recognition results.

Contribution

It introduces learnable moment queries and relative temporal embeddings for effective long-form video understanding, differing from traditional short-video focused methods.

Findings

01

Achieves state-of-the-art results on CharadesEgo

02

Efficient training on Ego4D with 8 GPUs in one day

03

Effectively models temporal dynamics in long videos

Abstract

We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEFL/ESL Teaching and Learning · Subtitles and Audiovisual Media · Second Language Learning and Teaching

MethodsSparse Evolutionary Training · ALIGN · Contrastive Learning