Hierarchical Conditional Relation Networks for Multimodal Video Question   Answering

Thao Minh Le; Vuong Le; Svetha Venkatesh; Truyen Tran

arXiv:2010.10019·cs.CV·January 5, 2021

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran

PDF

TL;DR

This paper introduces Hierarchical Conditional Relation Networks (HCRN) for multimodal video question answering, effectively modeling complex spatio-temporal and multimodal relations to improve performance on benchmark datasets.

Contribution

It proposes a novel neural unit called Conditional Relation Network (CRN) and a hierarchical architecture (HCRN) for better content selection and relation modeling in Video QA.

Findings

01

Achieved consistent improvements over state-of-the-art methods on TGIF-QA and TVQA datasets.

02

Demonstrated the effectiveness of CRN units in flexible multimodal relation encoding.

03

Validated the hierarchical approach for capturing video content and associated information.

Abstract

Video QA challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations in response to the query. To address these requirements, we start with two insights: (a) content selection and relation construction can be jointly encapsulated into a conditional computational structure, and (b) video-length structures can be composed hierarchically. For (a) this paper introduces a general-reusable neural unit dubbed Conditional Relation Network (CRN) taking as input a set of tensorial objects and translating into a new set of objects that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConditional Relation Network