ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

Adriano Fragomeni; Michael Wray; Dima Damen

arXiv:2210.04341·cs.CV·October 11, 2022

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

Adriano Fragomeni, Michael Wray, Dima Damen

PDF

Open Access 1 Repo

TL;DR

ConTra introduces a context-aware transformer architecture that leverages local temporal context to significantly improve cross-modal video retrieval accuracy, especially for short or ambiguous clips.

Contribution

The paper proposes a novel Context Transformer (ConTra) that models interactions between video clips and their local context using contrastive supervision, enhancing retrieval performance.

Findings

01

Improved retrieval accuracy on YouCook2, EPIC-KITCHENS, and ActivityNet datasets.

02

Effective modeling of local temporal context boosts performance.

03

Ablation studies confirm the importance of context modeling.

Abstract

In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adrianofragomeni/contra
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Softmax · Label Smoothing · Multi-Head Attention · Adam · Dense Connections