MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

Alkesh Patel; Joel Ruben Antony Moniz; Roman Nguyen; Nick Tzou; Hadas; Kotek; Vincent Renkens

arXiv:2110.06416·cs.CV·November 2, 2021

MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen, Nick Tzou, Hadas, Kotek, Vincent Renkens

PDF

Open Access

TL;DR

This paper introduces MMIU, a new dataset for understanding user intent in multimodal assistants, combining visual and textual data, and evaluates various models for intent classification.

Contribution

The paper presents the first dataset capturing human-annotated user questions and intents in multimodal contexts, along with baseline classification results.

Findings

01

Multimodal transformer achieves competitive intent classification accuracy.

02

Visual features significantly improve intent understanding.

03

The dataset enables benchmarking for multimodal intent classification.

Abstract

In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions targeted for multimodal assistant use cases, is not readily available. The research in visual question answering (VQA) and visual question generation (VQG) is a great step forward. However, they do not capture questions that a visually-abled person would ask multimodal assistants. Moreover, many times questions do not seek information from external knowledge. In this paper, we provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning