Maria: A Visual Experience Powered Conversational Agent
Zujie Liang, Huang Hu, Can Xu, Chongyang Tao, Xiubo Geng, Yining Chen,, Fan Liang, Daxin Jiang

TL;DR
Maria is a novel visual experience-powered conversational agent that retrieves images from a large-scale index and generates informed responses grounded in visual knowledge, advancing open-ended image-grounded dialogue.
Contribution
This work introduces Maria, a fully open-ended image-grounded conversational model with a retrieval-based visual knowledge component, unlike prior models relying on paired dialog-image data.
Findings
Maria outperforms state-of-the-art methods on automatic metrics.
Maria generates responses with visual commonsense of the physical world.
Extensive experiments validate the effectiveness of Maria's approach.
Abstract
Arguably, the visual perception of conversational agents to the physical world is a key way for them to exhibit the human-like intelligence. Image-grounded conversation is thus proposed to address this challenge. Existing works focus on exploring the multimodal dialog models that ground the conversation on a given image. In this paper, we take a step further to study image-grounded conversation under a fully open-ended setting where no paired dialog and image are assumed available. Specifically, we present Maria, a neural conversation agent powered by the visual world experiences which are retrieved from a large-scale image index. Maria consists of three flexible components, i.e., text-to-image retriever, visual concept detector and visual-knowledge-grounded response generator. The retriever aims to retrieve a correlated image to the dialog from an image index, while the visual concept…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
