Multi-Modal Open-Domain Dialogue

Kurt Shuster; Eric Michael Smith; Da Ju; Jason Weston

arXiv:2010.01082·cs.CL·October 5, 2020

Multi-Modal Open-Domain Dialogue

Kurt Shuster, Eric Michael Smith, Da Ju, Jason Weston

PDF

TL;DR

This paper develops a multi-modal dialogue agent that integrates vision and language models, outperforming existing models in multi-modal engagement while maintaining strong text-only conversational abilities.

Contribution

It introduces a novel multi-modal dialogue system combining vision and language models with effective fusion and training strategies, advancing open-domain conversational AI capabilities.

Findings

01

Outperforms existing multi-modal dialogue models in engagement metrics.

02

Maintains comparable performance to text-only BlenderBot in conversation quality.

03

Incorporates safety features without reducing engagement performance.

Abstract

Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020). However, if we want to build agents with human-like abilities, we must expand beyond handling just text. A particularly important topic is the ability to see images and communicate about what is perceived. With the goal of engaging humans in multi-modal dialogue, we investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models. We study incorporating different image fusion schemes and domain-adaptive pre-training and fine-tuning strategies, and show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.