CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Shuai Zhao; Linchao Zhu; Xiaohan Wang; Yi Yang

arXiv:2205.00823·cs.CV·May 3, 2022

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

PDF

1 Repo

TL;DR

CenterCLIP introduces a token clustering method to reduce redundancy in visual tokens for text-video retrieval, significantly lowering computation costs and improving semantic alignment, leading to state-of-the-art performance.

Contribution

The paper proposes a multi-segment token clustering algorithm for efficient visual token reduction in CLIP-based video retrieval models, enhancing speed and accuracy.

Findings

01

Reduces training memory cost by 35%

02

Speeds up inference by 14%

03

Achieves state-of-the-art results on text-video benchmarks

Abstract

Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mzhaoshuai/CenterCLIP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Residual Connection · Multi-Head Attention · Layer Normalization · Contrastive Language-Image Pre-training · Dense Connections · Vision Transformer