Structured Co-reference Graph Attention for Video-grounded Dialogue
Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo

TL;DR
This paper introduces SCGA, a novel video-grounded dialogue system that effectively models co-reference and semantic structure across modalities and time, significantly improving response quality on challenging benchmarks.
Contribution
The paper proposes the Structured Co-reference Graph Attention (SCGA) model, combining a structured co-reference resolver and spatio-temporal reasoning to enhance video-grounded dialogue understanding.
Findings
SCGA outperforms state-of-the-art systems on AVSD@DSTC7, AVSD@DSTC8, and TVQA datasets.
Extensive ablation studies confirm the effectiveness of each component.
Qualitative analysis shows improved interpretability of the model.
Abstract
A video-grounded dialogue system referred to as the Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video while keeping track of the dialogue context. Although recent efforts have made great strides in improving the quality of the response, performance is still far from satisfactory. The two main challenging issues are as follows: (1) how to deduce co-reference among multiple modalities and (2) how to reason on the rich underlying semantic structure of video with complex spatial and temporal dynamics. To this end, SCGA is based on (1) Structured Co-reference Resolver that performs dereferencing via building a structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner that captures local-to-global dynamics of video via gradually neighboring graph attention. SCGA makes use of pointer network to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition
MethodsSigmoid Activation · [LivE@PeRson]How do I talk to a real person at Expedia? · Tanh Activation · Long Short-Term Memory · Softmax · Pointer Network
