Cascaded Cross-Modal Transformer for Request and Complaint Detection
Nicolae-Catalin Ristea, Radu Tudor Ionescu

TL;DR
This paper introduces a cascaded cross-modal transformer that combines speech and text data to improve detection of customer requests and complaints in phone conversations, achieving high recall rates.
Contribution
The paper presents a novel cascaded cross-modal transformer model that integrates speech and text modalities for improved request and complaint detection.
Findings
Achieved 65.41% UAR for complaints.
Achieved 85.87% UAR for requests.
Effective multimodal approach for customer interaction analysis.
Abstract
We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages. Subsequently, we combine language-specific BERT-based models with Wav2Vec2.0 audio features in a novel cascaded cross-attention transformer model. We apply our system to the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPublic Relations and Crisis Communication · Sentiment Analysis and Opinion Mining · Speech and dialogue systems
