TL;DR
This paper introduces ROLL, a model for knowledge-based video question answering that combines dialogue understanding, unsupervised scene descriptions, and external knowledge, achieving state-of-the-art results on two datasets.
Contribution
The paper presents ROLL, a novel multi-task framework that integrates dialogue, scene descriptions, and external knowledge for improved video QA performance.
Findings
Achieves state-of-the-art results on KnowIT VQA and TVQA+ datasets.
Effectively combines multiple sources of information through a transformer-based fusion.
Demonstrates the importance of unsupervised scene descriptions and external knowledge in video understanding.
Abstract
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
