Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
Ni Wang, Dongliang Liao, Xing Xu

TL;DR
This paper introduces the Multi-Scale Temporal Difference Transformer (MSTDT), a novel model that enhances video-text retrieval by capturing local and global temporal information through difference features and multi-scale transformers.
Contribution
The paper proposes MSTDT, a transformer variant that effectively models local and global temporal dynamics in videos, improving retrieval performance over existing methods.
Findings
Achieved state-of-the-art results with CLIP backbone and MSTDT.
Effectively models local and global temporal information.
Introduced a new loss to improve sample similarity.
Abstract
Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · Layer Normalization · Contrastive Language-Image Pre-training · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
