On the hidden treasure of dialog in video question answering

Deniz Engin; Fran\c{c}ois Schnitzler; Ngoc Q. K. Duong; Yannis; Avrithis

arXiv:2103.14517·cs.CV·August 20, 2021

On the hidden treasure of dialog in video question answering

Deniz Engin, Fran\c{c}ois Schnitzler, Ngoc Q. K. Duong, Yannis, Avrithis

PDF

1 Repo

TL;DR

This paper introduces a novel VideoQA approach that leverages dialog summarization as a noisy source to understand stories without external annotations, outperforming state-of-the-art methods and even human evaluators.

Contribution

It presents a new method that converts dialog into text descriptions for story understanding in VideoQA, eliminating the need for external plot summaries or annotations.

Findings

01

Outperforms state-of-the-art on KnowIT VQA dataset

02

Achieves results surpassing human evaluators unfamiliar with episodes

03

Uses transformer-based encoding and simple fusion for multimodal inputs

Abstract

High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or human-made…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

InterDigitalInc/DialogSummary-VideoQA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.