DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition
HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, and Lihua Cai

TL;DR
DialogGraph-LLM introduces a novel graph-informed end-to-end framework combining multimodal foundation models and semi-supervised learning to improve speaker intent recognition in complex audio dialogues with limited labeled data.
Contribution
It presents a new Multi-Relational Dialogue Attention Network architecture integrated with foundation models and a confidence-aware semi-supervised learning strategy for audio dialogue intent recognition.
Findings
Outperforms strong audio and text baselines on proprietary and public datasets.
Demonstrates high accuracy and efficiency in real-world audio dialogue scenarios.
Effective semi-supervised learning reduces the need for extensive labeled data.
Abstract
Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and dialogue systems · Speech Recognition and Synthesis
