Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering
Jungin Park, Jiyoung Lee, Kwanghoon Sohn

TL;DR
This paper introduces a structure-aware graph interaction network for video question answering that leverages question-conditioned visual graphs and bridged visual interactions to improve answer accuracy.
Contribution
It proposes a novel bridge to answer framework with question-conditioned visual graphs and bridged visual interactions for enhanced video question answering.
Findings
Outperforms state-of-the-art methods on multiple benchmarks.
Effectively models appearance and motion cues in videos.
Demonstrates the importance of question-conditioned graph interactions.
Abstract
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video by leveraging adequate graph interactions of heterogeneous crossmodal graphs. To realize this, we learn question conditioned visual graphs by exploiting the relation between video and question to enable each visual node using question-to-visual interactions to encompass both visual and linguistic cues. In addition, we propose bridged visual-to-visual interactions to incorporate two complementary visual information on appearance and motion by placing the question graph as an intermediate bridge. This bridged architecture allows reliable message passing through compositional semantics of the question to generate an appropriate answer. As a result, our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
