Visual Answer Localization with Cross-modal Mutual Knowledge Transfer
Yixuan Weng, Bin Li

TL;DR
This paper introduces MutualSL, a cross-modal mutual knowledge transfer method for visual answer localization in videos, reducing cross-modal knowledge deviations and improving accuracy over state-of-the-art approaches.
Contribution
The paper proposes a novel MutualSL framework with dynamic loss to enhance cross-modal knowledge transfer and consistency in visual answer localization.
Findings
Outperforms state-of-the-art methods on three datasets
Reduces cross-modal knowledge deviation effectively
Improves semantic understanding between video and text modalities
Abstract
The goal of visual answering localization (VAL) in the video is to obtain a relevant and concise time clip from a video as the answer to the given natural language question. Early methods are based on the interaction modelling between video and text to predict the visual answer by the visual predictor. Later, using the textual predictor with subtitles for the VAL proves to be more precise. However, these existing methods still have cross-modal knowledge deviations from visual frames or textual subtitles. In this paper, we propose a cross-modal mutual knowledge transfer span localization (MutualSL) method to reduce the knowledge deviation. MutualSL has both visual predictor and textual predictor, where we expect the prediction results of these both to be consistent, so as to promote semantic knowledge understanding between cross-modalities. On this basis, we design a one-way dynamic loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
