UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot   Audio Task Learner

Dongchao Yang; Haohan Guo; Yuanyuan Wang; Rongjie Huang; Xiang Li; Xu; Tan; Xixin Wu; Helen Meng

arXiv:2406.10056·cs.SD·June 17, 2024·2 cites

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu, Tan, Xixin Wu, Helen Meng

PDF

Open Access 1 Repo

TL;DR

UniAudio 1.5 introduces a novel LLM-driven audio codec enabling large language models to perform multiple audio tasks in a few-shot manner without fine-tuning, by translating audio into a textual token space.

Contribution

The paper presents LLM-Codec, a new method to convert audio into a text-like token space for LLMs, enabling cross-modal few-shot learning for audio tasks.

Findings

01

Effective in multiple audio understanding tasks

02

Achieves good performance with few examples

03

Open-sourced LLM-Codec for research use

Abstract

The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, \textit{i.e.} representing audio tokens with words or sub-words in the vocabulary of LLMs, while keeping high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Thus, the audio representation can be viewed as a new \textit{foreign language}, and LLMs can learn the new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangdongchao/llm-codec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing