SELMA: A Speech-Enabled Language Model for Virtual Assistant   Interactions

Dominik Wagner; Alexander Churchill; Siddharth Sigtia; Erik Marchi

arXiv:2501.19377·cs.SD·February 4, 2025

SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Erik Marchi

PDF

Open Access

TL;DR

SELMA is a unified speech-enabled language model that integrates audio and text inputs to improve virtual assistant task performance, simplifying the pipeline and achieving significant accuracy improvements across multiple tasks.

Contribution

It introduces a multi-task end-to-end model with parameter-efficient training and a feature pooling strategy for virtual assistant interactions.

Findings

01

64% EER reduction on Voice Trigger detection

02

22% EER reduction on Device-Directed Speech Detection

03

Near-baseline word error rates on ASR

Abstract

In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Speech and dialogue systems · Natural Language Processing Techniques