AudioPaLM: A Large Language Model That Can Speak and Listen
Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur, Bapna, Zal\'an Borsos, F\'elix de Chaumont Quitry, Peter Chen, Dalia El, Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James, Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk

TL;DR
AudioPaLM is a unified multimodal model that combines speech and text processing capabilities, enabling advanced speech understanding, speech-to-speech translation, and zero-shot multilingual translation by leveraging large-scale pretraining.
Contribution
It introduces AudioPaLM, the first model to fuse text and speech large language models into a single architecture for diverse speech and language tasks.
Findings
Outperforms existing speech translation systems
Enables zero-shot speech-to-text translation for unseen language pairs
Transfers voice characteristics across languages in speech generation
Abstract
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
