Building Goal-Oriented Dialogue Systems with Situated Visual Context
Sanchit Agarwal, Jan Jezabek, Arijit Biswas, Emre Barut, Shuyang Gao,, Tagyoung Chung

TL;DR
This paper introduces a multimodal dialogue system that integrates visual context understanding with conversational agents, enabling more interactive and goal-oriented virtual assistants, especially in visual-rich scenarios like furniture shopping.
Contribution
It presents a novel multimodal framework that jointly reasons over visual and conversational context, along with a new dataset generation method for training such models.
Findings
Achieved 85% model accuracy in visual context reasoning
Developed a synthetic multimodal dialog simulator
Demonstrated effectiveness in a furniture shopping prototype
Abstract
Most popular goal-oriented dialogue agents are capable of understanding the conversational context. However, with the surge of virtual assistants with screen, the next generation of agents are required to also understand screen context in order to provide a proper interactive experience, and better understand users' goals. In this paper, we propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context. Specifically, we propose a new model, that can reason over the visual context within a conversation and populate API arguments with visual entities given the user query. Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity. In order to train our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · AI in Service Interactions
