Token-Level Contrastive Learning with Modality-Aware Prompting for   Multimodal Intent Recognition

Qianrui Zhou; Hua Xu; Hao Li; Hanlei Zhang; Xiaohan Zhang; Yifan Wang,; Kai Gao

arXiv:2312.14667·cs.MM·June 7, 2024·1 cites

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang,, Kai Gao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces TCL-MAP, a novel multimodal intent recognition framework that uses token-level contrastive learning and modality-aware prompting to better align and fuse features from text, video, and audio modalities, significantly improving performance.

Contribution

The paper proposes a new token-level contrastive learning method with modality-aware prompting for enhanced multimodal intent recognition, effectively aligning features across modalities and guiding learning with intent labels.

Findings

01

Achieves significant improvements over state-of-the-art methods.

02

Demonstrates the effectiveness of modality-aware prompts over handcrafted prompts.

03

Shows the superiority of the proposed TCL framework through extensive experiments.

Abstract

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thuiar/TCL-MAP
pytorchOfficial

Videos

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition· underline

Taxonomy

TopicsEmotion and Mood Recognition · Speech and dialogue systems · Multimodal Machine Learning Applications

MethodsNormalized Temperature-scaled Cross Entropy Loss · Contrastive Learning