Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu, Erhan Gundogdu, Maksim Lapin, Bernard Ghanem, Michael, Donoser, Loris Bazzani

TL;DR
This paper introduces a novel contrastive pre-training method that leverages language to improve temporal localization in videos, addressing limitations of existing approaches by enabling the video encoder to learn temporal boundaries and relations.
Contribution
It proposes a masked contrastive learning approach that captures visio-linguistic relations without freezing the video encoder, enhancing generalization and performance in temporal localization tasks.
Findings
Improves state-of-the-art in temporal action localization
Enhances few-shot temporal action localization
Boosts performance in video language grounding
Abstract
Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training · Contrastive Learning
