Towards Multimodal Understanding of Passenger-Vehicle Interactions in Autonomous Vehicles: Intent/Slot Recognition Utilizing Audio-Visual Data
Eda Okur, Shachi H Kumar, Saurav Sahay, Lama Nachman

TL;DR
This paper explores multimodal in-cabin data for autonomous vehicles, combining audio, visual, and language inputs to improve understanding of passenger intents and actions, advancing natural interaction capabilities.
Contribution
It introduces a multimodal approach integrating audio, visual, and language data for intent recognition in autonomous vehicle interactions, demonstrating improved accuracy over text-only methods.
Findings
Multimodal models outperform text-only baselines.
Enhanced intent detection accuracy with combined modalities.
Effective integration of visual and acoustic cues improves understanding.
Abstract
Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards developing contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly considering available three modalities (language/text, audio, video) and trigger the appropriate functionality of the AV system. We had collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via realistic scavenger hunt game. In our previous explorations, we experimented with various RNN-based models to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Autonomous Vehicle Technology and Safety · Video Surveillance and Tracking Methods
