Confirmation detection in human-agent interaction using non-lexical speech cues
Mara Brandt, Britta Wrede, Franz Kummert, Lars Schillingmann

TL;DR
This paper presents a system that detects non-lexical confirmations like 'mhm' in human-agent interactions using acoustic features and SVMs, improving accuracy especially with stacked formants.
Contribution
It introduces a novel approach using stacked formants for confirmation detection, outperforming other acoustic features in accuracy.
Findings
Stacked formants achieve 84% accuracy in confirmation detection.
Stacked formants outperform MFCC and pitch features.
Effective online classification in diverse user groups.
Abstract
Even if only the acoustic channel is considered, human communication is highly multi-modal. Non-lexical cues provide a variety of information such as emotion or agreement. The ability to process such cues is highly relevant for spoken dialog systems, especially in assistance systems. In this paper we focus on the recognition of non-lexical confirmations such as "mhm", as they enhance the system's ability to accurately interpret human intent in natural communication. The architecture uses a Support Vector Machine to detect confirmations based on acoustic features. In a systematic comparison, several feature sets were evaluated for their performance on a corpus of human-agent interaction in a setting with naive users including elderly and cognitively impaired people. Our results show that using stacked formants as features yield an accuracy of 84% outperforming regular formants and MFCC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
