Look Before you Speak: Visually Contextualized Utterances

Paul Hongsuck Seo; Arsha Nagrani; Cordelia Schmid

arXiv:2012.05710·cs.CV·March 30, 2021

Look Before you Speak: Visually Contextualized Utterances

Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

PDF

TL;DR

This paper introduces a new task for predicting future utterances in videos using visual and speech context, leveraging large-scale unlabeled instructional videos to improve multimodal dialogue understanding and outperform existing models.

Contribution

The paper presents a novel co-attentional multimodal video transformer and a large-scale training approach for visual contextualized dialogue prediction without manual annotations.

Findings

01

Outperforms text-only baselines in multimodal utterance prediction

02

Achieves state-of-the-art results on multiple VideoQA benchmarks

03

Demonstrates effectiveness of large-scale unlabeled video data for multimodal learning

Abstract

While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. Unfortunately, a major challenge for incorporating visual context into conversational dialogue is the lack of large-scale labeled datasets. We provide a solution in the form of a new visually conditioned Future Utterance Prediction task. Our task involves predicting the next utterance in a video, using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.