STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu; Shoubin Yu; Zhenfang Chen; Joshua B Tenenbaum and; Chuang Gan

arXiv:2405.09711·cs.AI·May 17, 2024·22 cites

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum and, Chuang Gan

PDF

Open Access

TL;DR

The paper introduces STAR, a new benchmark dataset for evaluating situated reasoning in real-world videos, emphasizing situation abstraction and logic-grounded question answering to advance machine intelligence.

Contribution

It presents a novel benchmark with a hyper-graph based representation of situations, procedural question generation, and a diagnostic neuro-symbolic model for better understanding reasoning challenges.

Findings

01

Existing models perform poorly on the benchmark.

02

The neuro-symbolic model improves reasoning performance.

03

The dataset highlights the complexity of situated reasoning in videos.

Abstract

Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition