Contrastive Language Video Time Pre-training
Hengyue Liu, Kyle Min, Hector A. Valdez, Subarna Tripathi

TL;DR
LAVITI is a contrastive learning approach that aligns language, video, and temporal features in long-form videos, enabling efficient training and state-of-the-art action recognition results.
Contribution
It introduces learnable moment queries and relative temporal embeddings for effective long-form video understanding, differing from traditional short-video focused methods.
Findings
Achieves state-of-the-art results on CharadesEgo
Efficient training on Ego4D with 8 GPUs in one day
Effectively models temporal dynamics in long videos
Abstract
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning · Subtitles and Audiovisual Media · Second Language Learning and Teaching
MethodsSparse Evolutionary Training · ALIGN · Contrastive Learning
