Learning to Locate Visual Answer in Video Corpus Using Question
Bin Li, Yixuan Weng, Bin Sun, Shutao Li

TL;DR
This paper introduces VCVAL, a new task for locating visual answers in large untrimmed video collections using natural language questions, and proposes a novel cross-modal contrastive method to improve performance.
Contribution
The paper presents a new task, VCVAL, along with a novel CCGS method and a reconstructed dataset MedVidCQA, advancing the understanding of instructional videos.
Findings
The proposed CCGS method outperforms existing methods in retrieval and localization.
The MedVidCQA dataset provides a benchmark for VCVAL.
Extensive experiments validate the effectiveness of the approach.
Abstract
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
