FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos
Zhengqian Wu, Ruizhe Li, Zijun Xu, Zhongyuan Wang, Chunxia Xiao, Chao, Liang

TL;DR
This paper introduces FriendsQA, a large-scale deep video understanding dataset with fine-grained topic categorization for story videos, enabling better assessment of VideoQA models' comprehension of complex storylines.
Contribution
It presents a novel dataset created using a language model-based framework, with detailed topic annotations, to evaluate deep video understanding in story videos.
Findings
State-of-the-art models show varied performance on FriendsQA.
The dataset reveals challenges in deep understanding of complex storylines.
FriendsQA enables more comprehensive evaluation of VideoQA models.
Abstract
Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interactions and long-range evolvement of core story topics including characters, actions and locations. Understanding these topics requires models to possess DVU capability. However, existing DVU datasets rarely organize questions according to these story topics, making them difficult to comprehensively assess VideoQA models' DVU capability of complex storylines. Additionally, the question quantity and video length of these dataset are limited by high labor costs of handcrafted dataset building method.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
