Building Goal-Oriented Dialogue Systems with Situated Visual Context

Sanchit Agarwal; Jan Jezabek; Arijit Biswas; Emre Barut; Shuyang Gao,; Tagyoung Chung

arXiv:2111.11576·cs.LG·November 26, 2021

Building Goal-Oriented Dialogue Systems with Situated Visual Context

Sanchit Agarwal, Jan Jezabek, Arijit Biswas, Emre Barut, Shuyang Gao,, Tagyoung Chung

PDF

Open Access

TL;DR

This paper introduces a multimodal dialogue system that integrates visual context understanding with conversational agents, enabling more interactive and goal-oriented virtual assistants, especially in visual-rich scenarios like furniture shopping.

Contribution

It presents a novel multimodal framework that jointly reasons over visual and conversational context, along with a new dataset generation method for training such models.

Findings

01

Achieved 85% model accuracy in visual context reasoning

02

Developed a synthetic multimodal dialog simulator

03

Demonstrated effectiveness in a furniture shopping prototype

Abstract

Most popular goal-oriented dialogue agents are capable of understanding the conversational context. However, with the surge of virtual assistants with screen, the next generation of agents are required to also understand screen context in order to provide a proper interactive experience, and better understand users' goals. In this paper, we propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context. Specifically, we propose a new model, that can reason over the visual context within a conversation and populate API arguments with visual entities given the user query. Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity. In order to train our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · AI in Service Interactions