WDMIR: Wavelet-Driven Multimodal Intent Recognition
Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu

TL;DR
This paper introduces a wavelet-based multimodal intent recognition framework that improves understanding of user intentions by analyzing non-verbal cues in the frequency domain, achieving state-of-the-art accuracy.
Contribution
The novel wavelet-driven fusion module and cross-modal interaction mechanism enhance multimodal intent recognition by effectively integrating verbal and non-verbal information in the frequency domain.
Findings
Achieved 1.13% higher accuracy than previous methods.
Wavelet-driven fusion significantly improves semantic extraction from non-verbal cues.
Ablation studies confirm the effectiveness of the proposed modules.
Abstract
Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Action Observation and Synchronization
