Language Guided Networks for Cross-modal Moment Retrieval
Kun Liu, Huadong Ma, and Chuang Gan

TL;DR
This paper introduces Language Guided Networks (LGN), a novel framework for cross-modal moment retrieval that uses sentence embeddings to guide visual feature extraction and localization, achieving superior performance on benchmark datasets.
Contribution
The paper proposes a new LGN framework that leverages sentence embeddings throughout the retrieval process, including feature modulation and localization, which enhances semantic alignment between vision and language.
Findings
Improved retrieval accuracy on Charades-STA and TACoS datasets.
Effective use of sentence guidance in feature extraction and localization.
Demonstrated superiority over existing methods.
Abstract
We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Existing methods independently extract the features of videos and sentences and purely utilize the sentence embedding in the multi-modal fusion stage, which do not make full use of the potential of language. In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval. In the first feature extraction stage, we propose to jointly learn visual and language features to capture the powerful visual information which can cover the complex semantics in the sentence query. Specifically, the early modulation unit is designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
