Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

TL;DR
Pengi is a versatile audio language model that transforms all audio tasks into text-generation problems, enabling open-ended and closed-ended tasks without additional fine-tuning, achieving state-of-the-art results across multiple benchmarks.
Contribution
Introduces Pengi, a novel Audio Language Model that unifies audio tasks as text-generation, allowing open-ended tasks without task-specific modifications.
Findings
State-of-the-art performance on 22 downstream tasks
Effective handling of open-ended audio tasks like captioning and Q&A
Unified architecture enables versatile audio understanding
Abstract
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
