Distilling an End-to-End Voice Assistant Without Instruction Training   Data

William Held; Ella Li; Michael Ryan; Weiyan Shi; Yanzhe Zhang; Diyi; Yang

arXiv:2410.02678·cs.CL·October 4, 2024

Distilling an End-to-End Voice Assistant Without Instruction Training Data

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi, Yang

PDF

Open Access 5 Models 1 Video

TL;DR

This paper introduces DiVA, a novel end-to-end speech model trained without instruction data, leveraging self-supervision from text-only LLM responses, achieving competitive performance with significantly less training compute.

Contribution

It presents a new training paradigm for Speech LLMs that avoids instruction data and annotated responses, improving efficiency and generalization.

Findings

01

DiVA outperforms state-of-the-art models in user preference tests.

02

DiVA requires over 100x less training compute.

03

DiVA generalizes well to spoken question answering, classification, and translation.

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$ 100x less training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Distilling an End-to-End Voice Assistant Without Instruction Training Data· underline

Taxonomy

TopicsAI in Service Interactions · Intelligent Tutoring Systems and Adaptive Learning