TL;DR
This paper introduces a novel VideoQA approach that leverages dialog summarization as a noisy source to understand stories without external annotations, outperforming state-of-the-art methods and even human evaluators.
Contribution
It presents a new method that converts dialog into text descriptions for story understanding in VideoQA, eliminating the need for external plot summaries or annotations.
Findings
Outperforms state-of-the-art on KnowIT VQA dataset
Achieves results surpassing human evaluators unfamiliar with episodes
Uses transformer-based encoding and simple fusion for multimodal inputs
Abstract
High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or human-made…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
