Multimodal Fusion Using Deep Learning Applied to Driver's Referencing of   Outside-Vehicle Objects

Abdul Rafey Aftab; Michael von der Beeck; Steven Rohrhirsch; Benoit; Diotte; Michael Feld

arXiv:2107.12167·cs.HC·July 27, 2021

Multimodal Fusion Using Deep Learning Applied to Driver's Referencing of Outside-Vehicle Objects

Abdul Rafey Aftab, Michael von der Beeck, Steven Rohrhirsch, Benoit, Diotte, Michael Feld

PDF

TL;DR

This paper presents a deep learning-based multimodal fusion approach for accurately referencing outside-vehicle objects using gaze, head pose, and finger pointing, addressing modality limitations and vehicle pose variations.

Contribution

It introduces a novel multimodal fusion network that combines gaze, head pose, and finger pointing data for improved object referencing in cars, highlighting the importance of multimodal sensing.

Findings

01

Multimodal fusion improves object referencing accuracy.

02

Vehicle pose significantly affects user behavior recognition.

03

Adding multiple modalities overcomes individual modality limitations.

Abstract

There is a growing interest in more intelligent natural user interaction with the car. Hand gestures and speech are already being applied for driver-car interaction. Moreover, multimodal approaches are also showing promise in the automotive industry. In this paper, we utilize deep learning for a multimodal fusion network for referencing objects outside the vehicle. We use features from gaze, head pose and finger pointing simultaneously to precisely predict the referenced objects in different car poses. We demonstrate the practical limitations of each modality when used for a natural form of referencing, specifically inside the car. As evident from our results, we overcome the modality specific limitations, to a large extent, by the addition of other modalities. This work highlights the importance of multimodal sensing, especially when moving towards natural user interaction.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.