BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues
Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi

TL;DR
BiST introduces a bidirectional spatio-temporal reasoning framework that enhances video-grounded dialogue understanding by integrating detailed spatial and temporal cues, leading to improved performance on benchmark datasets.
Contribution
The paper presents a novel neural framework that jointly models spatial and temporal information with bidirectional reasoning for better video-grounded dialogue responses.
Findings
Achieves competitive performance on AVSD benchmark.
Outperforms prior methods on TGIF-QA dataset.
Effectively models dynamic spatial-temporal interactions.
Abstract
Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-to-spatial reasoning. The bidirectional strategy aims to tackle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsDiffusion
