Building Multimodal AI Chatbots

Min Young Lee

arXiv:2305.03512·cs.CL·May 8, 2023·1 cites

Building Multimodal AI Chatbots

Min Young Lee

PDF

Open Access 1 Repo

TL;DR

This paper presents a multimodal chatbot system that integrates image retrieval and response generation to improve open-domain human-AI conversations involving shared images, demonstrating superior automatic and human evaluation results.

Contribution

It introduces a complete multimodal chatbot system with a ViT-BERT image retriever and ViT-GPT-2 response generator trained on PhotoChat, outperforming existing baselines in automatic and human evaluations.

Findings

01

Image retriever outperforms VSE++ and SCAN baselines.

02

Response generator surpasses Divter baseline in PPL and BLEU scores.

03

Human evaluation shows higher image-groundedness and engagingness.

Abstract

This work aims to create a multimodal AI system that chats with humans and shares relevant photos. While earlier works were limited to dialogues about specific objects or scenes within images, recent works have incorporated images into open-domain dialogues. However, their response generators are unimodal, accepting text input but no image input, thus prone to generating responses contradictory to the images shared in the dialogue. Therefore, this work proposes a complete chatbot system using two multimodal deep learning models: an image retriever that understands texts and a response generator that understands images. The image retriever, implemented by ViT and BERT, selects the most relevant image given the dialogue history and a database of images. The response generator, implemented by ViT and GPT-2/DialoGPT, generates an appropriate response given the dialogue history and the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minniie/multimodal_chat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · WordPiece · Dropout · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Dropout