Prompting Large Language Models with Speech Recognition Abilities
Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan, Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli,, Christian Fuegen, Mike Seltzer

TL;DR
This paper demonstrates that large language models can be extended with a small audio encoder to perform speech recognition, achieving multilingual capabilities and maintaining performance even when the LLM is frozen or using long audio segments.
Contribution
It introduces a method to enable LLMs to perform speech recognition by attaching an audio encoder, allowing multilingual ASR with minimal modifications to the original model.
Findings
Outperforms monolingual baselines by 18% on MLS
Enables multilingual speech recognition with LLaMA-7B
Maintains performance when the LLM is frozen or using long audio segments
Abstract
Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
