Multi-Modal Open-Domain Dialogue
Kurt Shuster, Eric Michael Smith, Da Ju, Jason Weston

TL;DR
This paper develops a multi-modal dialogue agent that integrates vision and language models, outperforming existing models in multi-modal engagement while maintaining strong text-only conversational abilities.
Contribution
It introduces a novel multi-modal dialogue system combining vision and language models with effective fusion and training strategies, advancing open-domain conversational AI capabilities.
Findings
Outperforms existing multi-modal dialogue models in engagement metrics.
Maintains comparable performance to text-only BlenderBot in conversation quality.
Incorporates safety features without reducing engagement performance.
Abstract
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020). However, if we want to build agents with human-like abilities, we must expand beyond handling just text. A particularly important topic is the ability to see images and communicate about what is perceived. With the goal of engaging humans in multi-modal dialogue, we investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models. We study incorporating different image fusion schemes and domain-adaptive pre-training and fine-tuning strategies, and show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
