WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christopher Arif, Setiadharma, Jingkang Yang, Ziwei Liu

TL;DR
WorldQA is a challenging multimodal video question-answering dataset that emphasizes long-chain reasoning and world knowledge, revealing current models' limitations and guiding future improvements in multimodal understanding.
Contribution
The paper introduces WorldQA, a novel dataset with long-chain reasoning and multimodal inputs, and proposes WorldRetriever, a model that synthesizes expert knowledge for improved reasoning.
Findings
WorldRetriever achieves 70% of human performance on multiple-choice questions.
Models perform worse with more frames, unlike humans who improve.
Long-chain reasoning significantly challenges current multimodal models.
Abstract
Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties: (1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation. (2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception. (3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · AI-based Problem Solving and Planning
