TL;DR
This paper introduces SIMMC, a new framework for training multimodal virtual assistants that handle complex, grounded conversations involving vision, memory, and multimodal actions, supported by new datasets and evaluation tasks.
Contribution
The paper presents two large multimodal dialogue datasets, a unified annotation framework, and benchmark tasks for training and evaluating multimodal conversational agents.
Findings
Existing models show strong baseline performance on SIMMC tasks.
Rich multimodal interactions can be effectively modeled and evaluated.
Datasets and tools are publicly available for further research.
Abstract
Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, in addition to the user's utterances), and perform multimodal actions (e.g., displaying a route in addition to generating the system's utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and, (b) fashion (grounded in an evolving set of images). We also provide logs of the items appearing in each scene, and contextual NLU and coreference annotations, using a novel and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
