A Straightforward Framework For Video Retrieval Using CLIP

Jes\'us Andr\'es Portillo-Quintero; Jos\'e Carlos Ortiz-Bayliss; Hugo; Terashima-Mar\'in

arXiv:2102.12443·cs.CV·March 1, 2021

A Straightforward Framework For Video Retrieval Using CLIP

Jes\'us Andr\'es Portillo-Quintero, Jos\'e Carlos Ortiz-Bayliss, Hugo, Terashima-Mar\'in

PDF

1 Repo

TL;DR

This paper presents a simple yet effective framework for video retrieval that leverages the CLIP model to generate video representations without annotations, achieving state-of-the-art results on major benchmarks.

Contribution

The work extends CLIP's capabilities to videos, enabling annotation-free video retrieval with improved performance over existing methods.

Findings

01

Achieved state-of-the-art results on MSR-VTT and MSVD benchmarks.

02

Demonstrated the effectiveness of CLIP-based representations for video retrieval.

03

Eliminated the need for user annotations in video retrieval tasks.

Abstract

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Deferf/CLIP_Video_Representation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.