TL;DR
CenterCLIP introduces a token clustering method to reduce redundancy in visual tokens for text-video retrieval, significantly lowering computation costs and improving semantic alignment, leading to state-of-the-art performance.
Contribution
The paper proposes a multi-segment token clustering algorithm for efficient visual token reduction in CLIP-based video retrieval models, enhancing speed and accuracy.
Findings
Reduces training memory cost by 35%
Speeds up inference by 14%
Achieves state-of-the-art results on text-video benchmarks
Abstract
Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Residual Connection · Multi-Head Attention · Layer Normalization · Contrastive Language-Image Pre-training · Dense Connections · Vision Transformer
