MAN: Moment Alignment Network for Natural Language Moment Retrieval via   Iterative Graph Adjustment

Da Zhang; Xiyang Dai; Xin Wang; Yuan-Fang Wang; Larry S. Davis

arXiv:1812.00087·cs.CV·May 21, 2019·34 cites

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, Larry S. Davis

PDF

Open Access

TL;DR

This paper introduces MAN, a novel network for natural language-based moment retrieval in videos, explicitly modeling temporal relations with an iterative graph adjustment, leading to significant performance improvements.

Contribution

The paper proposes a unified framework that combines candidate encoding and temporal reasoning, explicitly models moment relations as a graph, and introduces an iterative adjustment mechanism.

Findings

01

MAN outperforms state-of-the-art on DiDeMo and Charades-STA benchmarks.

02

Explicit graph modeling improves temporal relation understanding.

03

End-to-end training enhances overall retrieval accuracy.

Abstract

This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization