Instruction-Following Speech Recognition

Cheng-I Jeff Lai; Zhiyun Lu; Liangliang Cao; Ruoming Pang

arXiv:2309.09843·cs.CL·September 19, 2023

Instruction-Following Speech Recognition

Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang

PDF

Open Access

TL;DR

This paper introduces an instruction-following speech recognition model trained from scratch that can understand and execute diverse free-form instructions, enabling flexible speech tasks without relying on large language models.

Contribution

It presents a novel training approach for speech recognition models to follow natural language instructions, expanding capabilities beyond traditional transcription.

Findings

01

Model trained on Librispeech can interpret and execute instructions without pre-trained modules.

02

Enables tasks like transcript manipulation and summarization.

03

Provides selective transcription for privacy and safety.

Abstract

Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsFocus