Pengi: An Audio Language Model for Audio Tasks

Soham Deshmukh; Benjamin Elizalde; Rita Singh; Huaming Wang

arXiv:2305.11834·eess.AS·January 22, 2024·20 cites

Pengi: An Audio Language Model for Audio Tasks

Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

Pengi is a versatile audio language model that transforms all audio tasks into text-generation problems, enabling open-ended and closed-ended tasks without additional fine-tuning, achieving state-of-the-art results across multiple benchmarks.

Contribution

Introduces Pengi, a novel Audio Language Model that unifies audio tasks as text-generation, allowing open-ended tasks without task-specific modifications.

Findings

01

State-of-the-art performance on 22 downstream tasks

02

Effective handling of open-ended audio tasks like captioning and Q&A

03

Unified architecture enables versatile audio understanding

Abstract

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/pengi
pytorchOfficial

Videos

Pengi: An Audio Language Model for Audio Tasks· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing