SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation   for Multi-modal Intent Detection

Shijue Huang; Libo Qin; Bingbing Wang; Geng Tu; Ruifeng Xu

arXiv:2401.00424·cs.CL·January 2, 2024·2 cites

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, Ruifeng Xu

PDF

Open Access 1 Repo

TL;DR

This paper presents SDIF-DA, a framework that improves multi-modal intent detection by progressively aligning features across modalities and augmenting training data using ChatGPT, achieving state-of-the-art results.

Contribution

The paper introduces a novel shallow-to-deep interaction framework combined with ChatGPT-based data augmentation for enhanced multi-modal intent detection.

Findings

01

Achieves state-of-the-art performance in multi-modal intent detection.

02

Effectively aligns and fuses features across text, video, and audio modalities.

03

Data augmentation distills knowledge from large language models.

Abstract

Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joeying1019/sdif-da
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications

MethodsALIGN