AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K. Rubenstein; Chulayuth Asawaroengchai; Duc Dung Nguyen; Ankur; Bapna; Zal\'an Borsos; F\'elix de Chaumont Quitry; Peter Chen; Dalia El; Badawy; Wei Han; Eugene Kharitonov; Hannah Muckenhirn; Dirk Padfield; James; Qin; Danny Rozenberg; Tara Sainath; Johan Schalkwyk; Matt Sharifi; Michelle; Tadmor Ramanovich; Marco Tagliasacchi; Alexandru Tudor; Mihajlo; Velimirovi\'c; Damien Vincent; Jiahui Yu; Yongqiang Wang; Vicky Zayats; Neil; Zeghidour; Yu Zhang; Zhishuai Zhang; Lukas Zilka; Christian Frank

arXiv:2306.12925·cs.CL·June 23, 2023·41 cites

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur, Bapna, Zal\'an Borsos, F\'elix de Chaumont Quitry, Peter Chen, Dalia El, Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James, Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk

PDF

Open Access

TL;DR

AudioPaLM is a unified multimodal model that combines speech and text processing capabilities, enabling advanced speech understanding, speech-to-speech translation, and zero-shot multilingual translation by leveraging large-scale pretraining.

Contribution

It introduces AudioPaLM, the first model to fuse text and speech large language models into a single architecture for diverse speech and language tasks.

Findings

01

Outperforms existing speech translation systems

02

Enables zero-shot speech-to-text translation for unseen language pairs

03

Transfers voice characteristics across languages in speech generation

Abstract

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling