Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal   Reasoning for Real-world Video Question Answering

Lili Liang; Guanglu Sun; Jin Qiu; Lizhong Zhang

arXiv:2404.04007·cs.CV·April 8, 2024·2 cites

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

Lili Liang, Guanglu Sun, Jin Qiu, Lizhong Zhang

PDF

Open Access

TL;DR

This paper introduces NS-VideoQA, a neural-symbolic framework that enhances compositional spatio-temporal reasoning in real-world VideoQA by transforming videos into symbolic representations and applying top-down reasoning, leading to improved accuracy.

Contribution

It presents a novel neural-symbolic approach with a Scene Parser Network and a Symbolic Reasoning Machine for better reasoning in VideoQA tasks, enabling step-by-step analysis.

Findings

01

Improves reasoning accuracy on AGQA Decomp benchmark

02

Enables step-by-step error analysis

03

Enhances logical inference capabilities

Abstract

Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA). Existing approaches struggle to establish effective symbolic reasoning structures, which are crucial for answering compositional spatio-temporal questions. To address this challenge, we propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA), specifically designed for real-world VideoQA tasks. The uniqueness and superiority of NS-VideoQA are two-fold: 1) It proposes a Scene Parser Network (SPN) to transform static-dynamic video scenes into Symbolic Representation (SR), structuralizing persons, objects, relations, and action chronologies. 2) A Symbolic Reasoning Machine (SRM) is designed for top-down question decompositions and bottom-up compositional reasonings. Specifically, a polymorphic program executor is constructed for internally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling