MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction
Jing Wang, Aixin Sun, Hao Zhang, and Xiaoli Li

TL;DR
MS-DETR introduces a proposal-based, sampling moment-moment interaction model using DETR for efficient and accurate natural language video localization, achieving superior results on multiple datasets.
Contribution
The paper proposes MS-DETR, a novel sampling-based DETR framework that models moment-moment interactions for improved natural language video localization.
Findings
Outperforms existing methods on three public datasets.
Efficient sampling reduces computational complexity.
Effective cross-modal interaction modeling enhances localization accuracy.
Abstract
Given a query, the task of Natural Language Video Localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query. In this paper, we adopt a proposal-based solution that generates proposals (i.e., candidate moments) and then select the best matching proposal. On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling. The core idea is to sample a subset of moments guided by the learnable templates with an adopted DETR (DEtection TRansformer) framework. To achieve this, we design a multi-scale visual-linguistic encoder, and an anchor-guided moment decoder paired with a set of learnable templates. Experimental results on three public datasets demonstrate the superior performance of MS-DETR.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Residual Connection · Linear Layer · Layer Normalization · Byte Pair Encoding · Softmax · Label Smoothing · Absolute Position Encodings
