MS-DETR: Natural Language Video Localization with Sampling Moment-Moment   Interaction

Jing Wang; Aixin Sun; Hao Zhang; and Xiaoli Li

arXiv:2305.18969·cs.CV·August 22, 2023·1 cites

MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

Jing Wang, Aixin Sun, Hao Zhang, and Xiaoli Li

PDF

Open Access 1 Repo

TL;DR

MS-DETR introduces a proposal-based, sampling moment-moment interaction model using DETR for efficient and accurate natural language video localization, achieving superior results on multiple datasets.

Contribution

The paper proposes MS-DETR, a novel sampling-based DETR framework that models moment-moment interactions for improved natural language video localization.

Findings

01

Outperforms existing methods on three public datasets.

02

Efficient sampling reduces computational complexity.

03

Effective cross-modal interaction modeling enhances localization accuracy.

Abstract

Given a query, the task of Natural Language Video Localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query. In this paper, we adopt a proposal-based solution that generates proposals (i.e., candidate moments) and then select the best matching proposal. On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling. The core idea is to sample a subset of moments guided by the learnable templates with an adopted DETR (DEtection TRansformer) framework. To achieve this, we design a multi-scale visual-linguistic encoder, and an anchor-guided moment decoder paired with a set of learnable templates. Experimental results on three public datasets demonstrate the superior performance of MS-DETR.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k-nick/ms-detr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Residual Connection · Linear Layer · Layer Normalization · Byte Pair Encoding · Softmax · Label Smoothing · Absolute Position Encodings