Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

Yixuan Weng; Bin Li

arXiv:2210.14823·cs.CV·October 31, 2022

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

Yixuan Weng, Bin Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces MutualSL, a cross-modal mutual knowledge transfer method for visual answer localization in videos, reducing cross-modal knowledge deviations and improving accuracy over state-of-the-art approaches.

Contribution

The paper proposes a novel MutualSL framework with dynamic loss to enhance cross-modal knowledge transfer and consistency in visual answer localization.

Findings

01

Outperforms state-of-the-art methods on three datasets

02

Reduces cross-modal knowledge deviation effectively

03

Improves semantic understanding between video and text modalities

Abstract

The goal of visual answering localization (VAL) in the video is to obtain a relevant and concise time clip from a video as the answer to the given natural language question. Early methods are based on the interaction modelling between video and text to predict the visual answer by the visual predictor. Later, using the textual predictor with subtitles for the VAL proves to be more precise. However, these existing methods still have cross-modal knowledge deviations from visual frames or textual subtitles. In this paper, we propose a cross-modal mutual knowledge transfer span localization (MutualSL) method to reduce the knowledge deviation. MutualSL has both visual predictor and textual predictor, where we expect the prediction results of these both to be consistent, so as to promote semantic knowledge understanding between cross-modalities. On this basis, we design a one-way dynamic loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wengsyx/mutualsl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training