Contrastive Language-Action Pre-training for Temporal Localization

Mengmeng Xu; Erhan Gundogdu; Maksim Lapin; Bernard Ghanem; Michael; Donoser; Loris Bazzani

arXiv:2204.12293·cs.CV·April 27, 2022·5 cites

Contrastive Language-Action Pre-training for Temporal Localization

Mengmeng Xu, Erhan Gundogdu, Maksim Lapin, Bernard Ghanem, Michael, Donoser, Loris Bazzani

PDF

Open Access

TL;DR

This paper introduces a novel contrastive pre-training method that leverages language to improve temporal localization in videos, addressing limitations of existing approaches by enabling the video encoder to learn temporal boundaries and relations.

Contribution

It proposes a masked contrastive learning approach that captures visio-linguistic relations without freezing the video encoder, enhancing generalization and performance in temporal localization tasks.

Findings

01

Improves state-of-the-art in temporal action localization

02

Enhances few-shot temporal action localization

03

Boosts performance in video language grounding

Abstract

Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training · Contrastive Learning