Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances
Thao Minh Le, Nobuyuki Shimizu, Takashi Miyazaki, Koichi Shinoda

TL;DR
This paper introduces a novel multi-modal deep learning model and a new dataset for recognizing conversational addressees in complex social scenes using visual cues and utterances, advancing human-computer interaction.
Contribution
It presents the first end-to-end deep learning model combining vision and transcripts for addressee recognition and introduces a comprehensive dataset for this task.
Findings
Multi-modal model improves addressee prediction accuracy.
New dataset enables research in diverse social scenarios.
Model demonstrates potential for understanding human intentions.
Abstract
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals. Because previous studies typically focused only on pre-specified tasks with limited conversational situations such as controlling smart homes, we created a mock dataset called Addressee Recognition in Visual Scenes with Utterances (ARVSU) that contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario. We also propose a multi-modal deep-learning-based model that takes different human cues, specifically eye gazes and transcripts of an utterance corpus, into account to predict the conversational addressee from a specific speaker's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
