MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, Larry S. Davis

TL;DR
This paper introduces MAN, a novel network for natural language-based moment retrieval in videos, explicitly modeling temporal relations with an iterative graph adjustment, leading to significant performance improvements.
Contribution
The paper proposes a unified framework that combines candidate encoding and temporal reasoning, explicitly models moment relations as a graph, and introduces an iterative adjustment mechanism.
Findings
MAN outperforms state-of-the-art on DiDeMo and Charades-STA benchmarks.
Explicit graph modeling improves temporal relation understanding.
End-to-end training enhances overall retrieval accuracy.
Abstract
This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
