Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition
Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang,, Kai Gao

TL;DR
This paper introduces TCL-MAP, a novel multimodal intent recognition framework that uses token-level contrastive learning and modality-aware prompting to better align and fuse features from text, video, and audio modalities, significantly improving performance.
Contribution
The paper proposes a new token-level contrastive learning method with modality-aware prompting for enhanced multimodal intent recognition, effectively aligning features across modalities and guiding learning with intent labels.
Findings
Achieves significant improvements over state-of-the-art methods.
Demonstrates the effectiveness of modality-aware prompts over handcrafted prompts.
Shows the superiority of the proposed TCL framework through extensive experiments.
Abstract
Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmotion and Mood Recognition · Speech and dialogue systems · Multimodal Machine Learning Applications
MethodsNormalized Temperature-scaled Cross Entropy Loss · Contrastive Learning
