STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum and, Chuang Gan

TL;DR
The paper introduces STAR, a new benchmark dataset for evaluating situated reasoning in real-world videos, emphasizing situation abstraction and logic-grounded question answering to advance machine intelligence.
Contribution
It presents a novel benchmark with a hyper-graph based representation of situations, procedural question generation, and a diagnostic neuro-symbolic model for better understanding reasoning challenges.
Findings
Existing models perform poorly on the benchmark.
The neuro-symbolic model improves reasoning performance.
The dataset highlights the complexity of situated reasoning in videos.
Abstract
Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
