Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

Tianyi Liu; Yiming Li; Wenqian Wang; Jiaojiao Wang; Chen Cai; Yi Wang; Kim-Hui Yap

arXiv:2604.05947·cs.CV·April 8, 2026

Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

Tianyi Liu, Yiming Li, Wenqian Wang, Jiaojiao Wang, Chen Cai, Yi Wang, Kim-Hui Yap

PDF

TL;DR

This paper introduces a novel adaptive multimodal learning framework called MoME with HTL, enhancing fine-grained driver action recognition by improving expert collaboration and interpretability.

Contribution

The paper proposes a flexible MoME framework with HTL that dynamically adapts modality collaboration and enhances intra- and inter-expert learning for multimodal tasks.

Findings

01

Outperforms single-modal and existing multimodal baselines on driver action recognition.

02

HTL improves subtle multimodal understanding and interpretability.

03

Ablation studies confirm the effectiveness of the proposed strategies.

Abstract

Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.