Learning to Answer Visual Questions from Web Videos

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

arXiv:2205.05019·cs.CV·May 12, 2022·1 cites

Learning to Answer Visual Questions from Web Videos

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable method for video question answering by automatically generating large datasets from narrated web videos using cross-modal supervision and a question generation transformer, enabling zero-shot and improved answer diversity.

Contribution

It presents a novel automatic dataset creation approach for VideoQA using transcribed narrations and a question generation transformer, reducing manual annotation effort.

Findings

01

Generated the HowToVQA69M dataset with 69 million QA triplets.

02

Achieved state-of-the-art results on zero-shot VideoQA tasks.

03

Demonstrated the method's generalization to other web video datasets.

Abstract

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

antoyang/just-ask
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling