Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Ross Greer; Laura Fleig; Maitrayee Keskar; Erika Maquiling; Giovanni Tapia Lopez; Angel Martinez-Sanchez; Parthib Roy; Jake Rattigan; Mira Sur; Alejandra Vidrio; Thomas Marcotte; and Mohan Trivedi

arXiv:2602.07668·cs.CV·May 13, 2026

Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Ross Greer, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Giovanni Tapia Lopez, Angel Martinez-Sanchez, Parthib Roy, Jake Rattigan, Mira Sur, Alejandra Vidrio, Thomas Marcotte, and Mohan Trivedi

PDF

TL;DR

This paper extends the LILO framework to include audio signals, creating the L-LIO system that improves driver safety assessment and environment understanding through multimodal sensor fusion.

Contribution

The paper introduces the L-LIO framework, integrating audio modality into existing visual-based systems for enhanced driver and scene understanding in autonomous vehicles.

Findings

01

Audio provides safety-relevant insights in nuanced scenarios.

02

Audio can disambiguate external guidance and gestures.

03

Pilot results show benefits of multimodal sensor fusion.

Abstract

The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.