BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded   Dialogues

Hung Le; Doyen Sahoo; Nancy F. Chen; Steven C.H. Hoi

arXiv:2010.10095·cs.CV·October 21, 2020·1 cites

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi

PDF

Open Access 1 Repo

TL;DR

BiST introduces a bidirectional spatio-temporal reasoning framework that enhances video-grounded dialogue understanding by integrating detailed spatial and temporal cues, leading to improved performance on benchmark datasets.

Contribution

The paper presents a novel neural framework that jointly models spatial and temporal information with bidirectional reasoning for better video-grounded dialogue responses.

Findings

01

Achieves competitive performance on AVSD benchmark.

02

Outperforms prior methods on TGIF-QA dataset.

03

Effectively models dynamic spatial-temporal interactions.

Abstract

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-to-spatial reasoning. The bidirectional strategy aims to tackle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salesforce/BiST
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsDiffusion