TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang,, Pengzhang Liu, Yongjun Bao, Guiguang Ding

TL;DR
TempMe introduces a novel, efficient method for text-video retrieval that reduces redundancy and computational costs by merging temporal tokens, leading to faster, more resource-friendly performance with improved accuracy.
Contribution
The paper proposes TempMe, a parameter-efficient and computationally efficient architecture that leverages progressive multi-granularity token merging for enhanced temporal modeling in video retrieval.
Findings
Reduces output tokens by 95%
Achieves 1.8X speedup in retrieval
Uses 51% fewer GFLOPs
Abstract
Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity.…
Peer Reviews
Decision·ICLR 2025 Poster
1, The paper addresses the issue of spatial-temporal redundancy in videos for text-video retrieval and introduces an efficient method that achieves faster training times and inference. 2, Extensive experiments and analyses of TempMe demonstrate its efficiency, effectiveness, and generalization capabilities. 3, The ClipMe Block mainly involves two steps, "Intra-clip Merging" and "Cross-clip Merging", each employing distinct methods for token grouping. This design effectively aids in information m
1, Although the paper addresses spatial-temporal redundancy in text-video retrieval, there is already substantial work on token merging and pruning in video processing. This overlap may affect the perceived uniqueness of the proposed approach. 2, A few symbols lack adequate definitions, which may hinder readability. For instance, "R_c" in Section 3.2, while defined in Figure 3 as the ratio of kept tokens, should also be described in the text when first introduced for clarity.
1. **Efficient Training/Inference Acceleration**: TempMe achieves significant improvements in training efficiency by implementing image merging followed by progressive video clip merging. This approach leads to substantial reductions in GFLOPS and training time compared to previous methods. 2. **Comprehensive Ablation Studies**: Extensive ablation experiments validate the effectiveness of clip merging at different layers and intervals, as well as its applicability on stronger video-pretrained b
1. **Limited Improvement in R@1 for T2V Retrieval**: The R@1 improvement over previous methods is relatively small, especially given the advances in MLLM models. In 2023, T2V retrieval methods like HBI and Cap4video on CLIP-ViT-32/16, R@1 have reached around 48/50 (e.g., [1][2]). I suspect the limited improvement may stem from TempMe still focusing on video merging within the encoder, without further optimization after obtaining the clip-level representation. 2. **Lack of Memory Usage Compariso
The paper addresses an important task, namely text-video retrieval and proposes a parameter-efficient method for this. One strength of the paper is that the proposed method was extended to video foundation methods such as UMT and various backbones showcasing the extensibility of the method. Also, the method is based on a well-known fact, namely the video has a lot of redundant information from frame to frame and the method is built upon that observation and compresses the redundant information.
- Abstract: the abstract is hard to follow and is not clear if the method addressed inference speed-ups or training time speed-ups or both. - overall the quality of the writing can be improved because it's not straightforward to follow the paper, for example in the introduction the transitions between paragraphs are abrupt. - the choice of sampling 12 frames for MSRVTT and 64 frames for ActivityNet seems arbitrary.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Focus
