LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou

TL;DR
LD-DETR introduces a novel loop decoder transformer that enhances video moment retrieval and highlight detection by addressing semantic overlap, local feature extraction, and decoding limitations, achieving superior results on multiple benchmarks.
Contribution
The paper proposes LD-DETR, a new model that improves multimodal feature alignment, local feature extraction, and decoding efficiency for video understanding tasks.
Findings
Outperforms state-of-the-art on QVHighlight, Charades-STA, and TACoS datasets.
Effectively mitigates semantic overlap issues in datasets.
Enhances local feature extraction and decoding in multimodal video analysis.
Abstract
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model's multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
