FILS: Self-Supervised Video Feature Prediction In Semantic Language   Space

Mona Ahmadian; Frank Guerin; Andrew Gilbert

arXiv:2406.03447·cs.CV·June 6, 2024·1 cites

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian, Frank Guerin, Andrew Gilbert

PDF

Open Access

TL;DR

FILS introduces a self-supervised method for learning semantic video representations by predicting masked features in language space, enhancing transferability to action recognition tasks with less computation.

Contribution

The paper proposes a novel self-supervised video feature prediction method in semantic language space, improving transferability and efficiency over previous approaches.

Findings

01

Achieves state-of-the-art results on egocentric datasets

02

Uses less computation and smaller batches

03

Demonstrates strong transferability to downstream tasks

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques