TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

Wenzhuo Liu; Yicheng Qiao; Zhen Wang; Qiannan Guo; Zilong Chen; Meihua Zhou; Xinran Li; Letian Wang; Zhiwei Li; Huaping Liu; Wenshuo Wang

arXiv:2506.18084·cs.CV·June 24, 2025

TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

Wenzhuo Liu, Yicheng Qiao, Zhen Wang, Qiannan Guo, Zilong Chen, Meihua Zhou, Xinran Li, Letian Wang, Zhiwei Li, Huaping Liu, Wenshuo Wang

PDF

TL;DR

TEM^3-Learning introduces a time-efficient, multimodal multi-task framework for assistive driving that jointly recognizes driver emotions, behaviors, traffic context, and vehicle actions with high accuracy and real-time performance.

Contribution

The paper presents a novel two-stage architecture combining efficient feature extraction and adaptive multimodal integration for multi-task assistive driving, addressing limitations of modality constraints and computational inefficiency.

Findings

01

Achieves state-of-the-art accuracy on AIDE dataset for all four tasks.

02

Maintains a lightweight model with fewer than 6 million parameters.

03

Delivers 142.32 FPS inference speed, enabling real-time deployment.

Abstract

Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.