Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models
Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis, Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

TL;DR
This paper presents a resource-efficient multimodal system that detects device-directed speech without trigger phrases by combining audio and speech recognition signals using large language models, requiring minimal training data.
Contribution
It introduces a novel multimodal approach utilizing low-rank adaptation and prefix tuning for resource-efficient device-directed speech detection with large language models.
Findings
Multimodal approach achieves lower EERs than unimodal baselines.
System requires only a small amount of training data (80k examples or less).
Low-dimensional audio representations outperform high-dimensional ones.
Abstract
Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
