A CLIP-Hitchhiker's Guide to Long Video Retrieval

Max Bain; Arsha Nagrani; G\"ul Varol; Andrew Zisserman

arXiv:2205.08508·cs.CV·May 18, 2022·25 cites

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Max Bain, Arsha Nagrani, G\"ul Varol, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

This paper adapts CLIP for long video retrieval by introducing a weighted-mean temporal aggregation method based on query-scoring, achieving state-of-the-art results with a simple yet effective approach.

Contribution

It proposes a novel weighted-mean temporal aggregation method for CLIP-based video retrieval, outperforming previous temporal modeling techniques.

Findings

01

Weighted-mean aggregation significantly improves retrieval performance.

02

The simple baseline outperforms complex temporal modeling methods.

03

Achieves state-of-the-art results on multiple long video retrieval benchmarks.

Abstract

Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

m-bain/clip-hitchhiker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training