Building Multimodal AI Chatbots
Min Young Lee

TL;DR
This paper presents a multimodal chatbot system that integrates image retrieval and response generation to improve open-domain human-AI conversations involving shared images, demonstrating superior automatic and human evaluation results.
Contribution
It introduces a complete multimodal chatbot system with a ViT-BERT image retriever and ViT-GPT-2 response generator trained on PhotoChat, outperforming existing baselines in automatic and human evaluations.
Findings
Image retriever outperforms VSE++ and SCAN baselines.
Response generator surpasses Divter baseline in PPL and BLEU scores.
Human evaluation shows higher image-groundedness and engagingness.
Abstract
This work aims to create a multimodal AI system that chats with humans and shares relevant photos. While earlier works were limited to dialogues about specific objects or scenes within images, recent works have incorporated images into open-domain dialogues. However, their response generators are unimodal, accepting text input but no image input, thus prone to generating responses contradictory to the images shared in the dialogue. Therefore, this work proposes a complete chatbot system using two multimodal deep learning models: an image retriever that understands texts and a response generator that understands images. The image retriever, implemented by ViT and BERT, selects the most relevant image given the dialogue history and a database of images. The response generator, implemented by ViT and GPT-2/DialoGPT, generates an appropriate response given the dialogue history and the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · WordPiece · Dropout · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Dropout
