Object-Centric Representation Learning for Video Question Answering
Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran

TL;DR
This paper introduces an object-centric representation framework for video question answering, enabling more systematic reasoning by constructing relational graphs of objects and their interactions over time.
Contribution
It proposes a novel query-guided approach to convert videos into dynamic relational graphs for improved reasoning in Video QA tasks.
Findings
Object-centric approach improves reasoning accuracy.
Relational graphs effectively capture spatio-temporal object interactions.
Framework outperforms existing methods on major Video QA datasets.
Abstract
Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. The task demands new capabilities to integrate video processing, language understanding, binding abstract linguistic concepts to concrete visual artifacts, and deliberative reasoning over spacetime. Neural networks offer a promising approach to reach this potential through learning from examples rather than handcrafting features and rules. However, neural networks are predominantly feature-based - they map data to unstructured vectorial representation and thus can fall into the trap of exploiting shortcuts through surface statistics instead of true systematic reasoning seen in symbolic systems. To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
