ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
Adriano Fragomeni, Michael Wray, Dima Damen

TL;DR
ConTra introduces a context-aware transformer architecture that leverages local temporal context to significantly improve cross-modal video retrieval accuracy, especially for short or ambiguous clips.
Contribution
The paper proposes a novel Context Transformer (ConTra) that models interactions between video clips and their local context using contrastive supervision, enhancing retrieval performance.
Findings
Improved retrieval accuracy on YouCook2, EPIC-KITCHENS, and ActivityNet datasets.
Effective modeling of local temporal context boosts performance.
Ablation studies confirm the importance of context modeling.
Abstract
In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Softmax · Label Smoothing · Multi-Head Attention · Adam · Dense Connections
