TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection
Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie

TL;DR
TR-DETR introduces a novel task-reciprocal transformer that leverages the inherent relationship between video moment retrieval and highlight detection, improving performance through shared feature alignment and task cooperation.
Contribution
The paper proposes a task-reciprocal transformer that explicitly models the mutual influence between MR and HD, enhancing joint video analysis beyond existing separate or loosely coupled methods.
Findings
Outperforms state-of-the-art methods on multiple datasets
Effectively aligns multi-modal features into a shared space
Utilizes reciprocity to refine retrieval and highlight prediction
Abstract
Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Label Smoothing · Adam · Dropout · Feedforward Network · Absolute Position Encodings
