MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen, Nick Tzou, Hadas, Kotek, Vincent Renkens

TL;DR
This paper introduces MMIU, a new dataset for understanding user intent in multimodal assistants, combining visual and textual data, and evaluates various models for intent classification.
Contribution
The paper presents the first dataset capturing human-annotated user questions and intents in multimodal contexts, along with baseline classification results.
Findings
Multimodal transformer achieves competitive intent classification accuracy.
Visual features significantly improve intent understanding.
The dataset enables benchmarking for multimodal intent classification.
Abstract
In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions targeted for multimodal assistant use cases, is not readily available. The research in visual question answering (VQA) and visual question generation (VQG) is a great step forward. However, they do not capture questions that a visually-abled person would ask multimodal assistants. Moreover, many times questions do not seek information from external knowledge. In this paper, we provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
