TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and   Highlight Detection

Hao Sun; Mingyao Zhou; Wenjing Chen; Wei Xie

arXiv:2401.02309·cs.CV·January 8, 2024·1 cites

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie

PDF

Open Access 1 Repo

TL;DR

TR-DETR introduces a novel task-reciprocal transformer that leverages the inherent relationship between video moment retrieval and highlight detection, improving performance through shared feature alignment and task cooperation.

Contribution

The paper proposes a task-reciprocal transformer that explicitly models the mutual influence between MR and HD, enhancing joint video analysis beyond existing separate or loosely coupled methods.

Findings

01

Outperforms state-of-the-art methods on multiple datasets

02

Effectively aligns multi-modal features into a shared space

03

Utilizes reciprocity to refine retrieval and highlight prediction

Abstract

Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mingyao1120/tr-detr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Label Smoothing · Adam · Dropout · Feedforward Network · Absolute Position Encodings