Device-Directed Speech Detection for Follow-up Conversations Using Large   Language Models

Ognjen (Oggi) Rudovic; Pranay Dighe; Yi Su; Vineet Garg; Sameer; Dharur; Xiaochuan Niu; Ahmed H. Abdelaziz; Saurabh Adya; Ahmed Tewfik

arXiv:2411.00023·eess.AS·November 6, 2024

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Ognjen (Oggi) Rudovic, Pranay Dighe, Yi Su, Vineet Garg, Sameer, Dharur, Xiaochuan Niu, Ahmed H. Abdelaziz, Saurabh Adya, Ahmed Tewfik

PDF

Open Access

TL;DR

This paper investigates using large language models to improve device-directed speech detection in follow-up conversations with virtual assistants, significantly reducing false alarms by leveraging context and ASR uncertainty.

Contribution

It introduces a novel approach of applying LLM prompting and adaptation for DDSD, incorporating ASR uncertainty to enhance detection accuracy in follow-up queries.

Findings

01

20-40% reduction in false alarms

02

Improved detection accuracy with context modeling

03

Effective use of ASR uncertainty in prompts

Abstract

Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis