Prompting Large Language Models with Speech Recognition Abilities

Yassir Fathullah; Chunyang Wu; Egor Lakomkin; Junteng Jia; Yuan; Shangguan; Ke Li; Jinxi Guo; Wenhan Xiong; Jay Mahadeokar; Ozlem Kalinli,; Christian Fuegen; Mike Seltzer

arXiv:2307.11795·eess.AS·July 25, 2023·2 cites

Prompting Large Language Models with Speech Recognition Abilities

Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan, Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli,, Christian Fuegen, Mike Seltzer

PDF

Open Access

TL;DR

This paper demonstrates that large language models can be extended with a small audio encoder to perform speech recognition, achieving multilingual capabilities and maintaining performance even when the LLM is frozen or using long audio segments.

Contribution

It introduces a method to enable LLMs to perform speech recognition by attaching an audio encoder, allowing multilingual ASR with minimal modifications to the original model.

Findings

01

Outperforms monolingual baselines by 18% on MLS

02

Enables multilingual speech recognition with LLaMA-7B

03

Maintains performance when the LLM is frozen or using long audio segments

Abstract

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques