SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World   Knowledge

Andong Wang; Bo Wu; Sunli Chen; Zhenfang Chen; Haotian Guan; Wei-Ning; Lee; Li Erran Li; Chuang Gan

arXiv:2405.09713·cs.CV·May 20, 2024

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning, Lee, Li Erran Li, Chuang Gan

PDF

Open Access

TL;DR

SOK-Bench is a comprehensive video reasoning benchmark with 44K questions that tests understanding of situated and open-world knowledge, created using large language and multimodal models for advancing AI reasoning capabilities.

Contribution

The paper introduces SOK-Bench, a novel large-scale benchmark for evaluating situated video reasoning with integrated open-world knowledge, generated through an automated LLM/MLLM-based process.

Findings

01

Recent vision-language models show varied performance on SOK-Bench.

02

The benchmark reveals gaps in current models' reasoning abilities.

03

SOK-Bench enables more nuanced evaluation of commonsense reasoning in videos.

Abstract

Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)