Look Before you Speak: Visually Contextualized Utterances
Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

TL;DR
This paper introduces a new task for predicting future utterances in videos using visual and speech context, leveraging large-scale unlabeled instructional videos to improve multimodal dialogue understanding and outperform existing models.
Contribution
The paper presents a novel co-attentional multimodal video transformer and a large-scale training approach for visual contextualized dialogue prediction without manual annotations.
Findings
Outperforms text-only baselines in multimodal utterance prediction
Achieves state-of-the-art results on multiple VideoQA benchmarks
Demonstrates effectiveness of large-scale unlabeled video data for multimodal learning
Abstract
While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. Unfortunately, a major challenge for incorporating visual context into conversational dialogue is the lack of large-scale labeled datasets. We provide a solution in the form of a new visually conditioned Future Utterance Prediction task. Our task involves predicting the next utterance in a video, using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
