Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice
Vikramjit Mitra, Sue Booker, Erik Marchi, David Scott Farrar, Ute, Dorothea Peitz, Bridget Cheng, Ermine Teves, Anuj Mehta, Devang Naik

TL;DR
This paper explores using acoustic and paralinguistic embeddings to detect vocal expression in voice queries, improving intent understanding in digital assistants by capturing how something is said.
Contribution
It introduces a novel approach combining acoustic and emotion embeddings for expression detection, showing significant error rate reductions over traditional lexical methods.
Findings
Acoustic and paralinguistic cues significantly improve expression detection.
Emotion embedding reduces error rate by 30%.
Proposed method decreases EER by 60% compared to bag-of-words.
Abstract
Millions of people reach out to digital assistants such as Siri every day, asking for information, making phone calls, seeking assistance, and much more. The expectation is that such assistants should understand the intent of the users query. Detecting the intent of a query from a short, isolated utterance is a difficult task. Intent cannot always be obtained from speech-recognized transcriptions. A transcription driven approach can interpret what has been said but fails to acknowledge how it has been said, and as a consequence, may ignore the expression present in the voice. Our work investigates whether a system can reliably detect vocal expression in queries using acoustic and paralinguistic embedding. Results show that the proposed method offers a relative equal error rate (EER) decrease of 60% compared to a bag-of-word based system, corroborating that expression is significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
