TL;DR
This paper introduces DualVGR, a novel dual-visual graph reasoning unit for video question answering that enhances multi-step reasoning by filtering irrelevant features and modeling appearance-motion relations, achieving state-of-the-art results.
Contribution
The paper proposes a dual-visual graph reasoning framework with an explainable query punishment module and multi-view graph attention network for improved VideoQA.
Findings
Achieves state-of-the-art on MSVD-QA and SVQA datasets.
Demonstrates competitive results on MSRVTT-QA.
Effectively filters irrelevant features during reasoning.
Abstract
Video question answering is a challenging task, which requires agents to be able to understand rich video contents and perform spatial-temporal reasoning. However, existing graph-based methods fail to perform multi-step reasoning well, neglecting two properties of VideoQA: (1) Even for the same video, different questions may require different amount of video clips or objects to infer the answer with relational reasoning; (2) During reasoning, appearance and motion features have complicated interdependence which are correlated and complementary to each other. Based on these observations, we propose a Dual-Visual Graph Reasoning Unit (DualVGR) which reasons over videos in an end-to-end fashion. The first contribution of our DualVGR is the design of an explainable Query Punishment Module, which can filter out irrelevant visual features through multiple cycles of reasoning. The second…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
