MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition
Ruoyu Wang, Wenqian Wang, Jianjun Gao, Dan Lin, Kim-Hui Yap, Bingbing, Li

TL;DR
MultiFuser is a novel multimodal transformer that effectively integrates various sensor data to improve driver action recognition, especially in challenging lighting conditions, by modeling cross-modal interactions and features.
Contribution
The paper introduces MultiFuser, a transformer-based model with Bi-decomposed Modules for cross-modal fusion, advancing driver behavior analysis in complex environments.
Findings
Outperforms existing methods on Drive&Act dataset
Effectively fuses multimodal data for robust recognition
Demonstrates improved accuracy in dark and gloomy conditions
Abstract
Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
