SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection
Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, Ruifeng Xu

TL;DR
This paper presents SDIF-DA, a framework that improves multi-modal intent detection by progressively aligning features across modalities and augmenting training data using ChatGPT, achieving state-of-the-art results.
Contribution
The paper introduces a novel shallow-to-deep interaction framework combined with ChatGPT-based data augmentation for enhanced multi-modal intent detection.
Findings
Achieves state-of-the-art performance in multi-modal intent detection.
Effectively aligns and fuses features across text, video, and audio modalities.
Data augmentation distills knowledge from large language models.
Abstract
Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications
MethodsALIGN
