MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action   Recognition

Ruoyu Wang; Wenqian Wang; Jianjun Gao; Dan Lin; Kim-Hui Yap; Bingbing; Li

arXiv:2408.01766·cs.CV·August 20, 2024

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Ruoyu Wang, Wenqian Wang, Jianjun Gao, Dan Lin, Kim-Hui Yap, Bingbing, Li

PDF

TL;DR

MultiFuser is a novel multimodal transformer that effectively integrates various sensor data to improve driver action recognition, especially in challenging lighting conditions, by modeling cross-modal interactions and features.

Contribution

The paper introduces MultiFuser, a transformer-based model with Bi-decomposed Modules for cross-modal fusion, advancing driver behavior analysis in complex environments.

Findings

01

Outperforms existing methods on Drive&Act dataset

02

Effectively fuses multimodal data for robust recognition

03

Demonstrates improved accuracy in dark and gloomy conditions

Abstract

Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.