Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran

TL;DR
This paper introduces HOSTR, a hierarchical object-oriented neural network that models dynamic objects and their interactions over space and time, significantly improving Video QA performance and interpretability.
Contribution
The paper proposes a novel hierarchical object-oriented reasoning framework for Video QA, emphasizing object interactions and hierarchically abstracted representations.
Findings
Achieves new state-of-the-art results on multiple Video QA datasets.
Object-oriented reasoning enhances interpretability and efficiency.
The model effectively captures dynamic object interactions over space and time.
Abstract
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol-like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
