LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval   and Highlight Detection

Pengcheng Zhao; Zhixian He; Fuwei Zhang; Shujin Lin; Fan Zhou

arXiv:2501.10787·cs.CV·January 22, 2025

LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection

Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou

PDF

Open Access 1 Repo

TL;DR

LD-DETR introduces a novel loop decoder transformer that enhances video moment retrieval and highlight detection by addressing semantic overlap, local feature extraction, and decoding limitations, achieving superior results on multiple benchmarks.

Contribution

The paper proposes LD-DETR, a new model that improves multimodal feature alignment, local feature extraction, and decoding efficiency for video understanding tasks.

Findings

01

Outperforms state-of-the-art on QVHighlight, Charades-STA, and TACoS datasets.

02

Effectively mitigates semantic overlap issues in datasets.

03

Enhances local feature extraction and decoding in multimodal video analysis.

Abstract

Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model's multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qingchen239/ld-detr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer