WavJourney: Compositional Audio Creation with Large Language Models

Xubo Liu; Zhongkai Zhu; Haohe Liu; Yi Yuan; Meng Cui; Qiushi Huang,; Jinhua Liang; Yin Cao; Qiuqiang Kong; Mark D. Plumbley; Wenwu Wang

arXiv:2307.14335·cs.SD·November 28, 2023·6 cites

WavJourney: Compositional Audio Creation with Large Language Models

Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang,, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

WavJourney is a novel framework that uses large language models to generate complex, multi-element audio content from textual descriptions, enabling controllable and realistic storytelling audio synthesis.

Contribution

It introduces a new method that connects LLMs with audio models through structured scripts for compositional audio creation from text.

Findings

01

Achieves state-of-the-art results on text-to-audio benchmarks.

02

Capable of synthesizing realistic, multi-element audio aligned with semantic and spatial conditions.

03

Facilitates human-machine co-creation in multi-round dialogues.

Abstract

Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program,…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper focused on a compositional audio content creation, not the end-to-end generation model. In audio domain, I believe that this kind of compositional audio content creation will be more useful, because audio has specific physically proven time-frequency relationship, and is really weak for noise (in human perception), so that, a compositional creation can be one solution. And, the paper described one way to do that by tackling in sound types, sounds level, sounds position, etc.

Weaknesses

I think this paper is more like a positional paper rather than an experimental paper. The authors evaluated the proposed methods in two ways. One for validating the proposed method, they evaluated the proposed method on already established text2audio generation task. This evaluation showed that the proposed compositional audio generation method is working well within an audio generation task which I think is enough to show the effectiveness of the proposed method. As a second experiment, the aut

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

The problem is interesting.

Weaknesses

1. The way this paper deals with audio temporal relationships is to generate audio using multiple models and manually connect the generated audio. However, considering the ambiguity of language and the complexity of natural audio, there may be partly overlap among the generated foreground audio lists. This method cannot well fit the distribution of natural audio data. The audio shown in the demo and the mel-spectrogram shown in the article show this shortcoming: there is a clear separation betwe

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1) The generated samples are well done. 2) The framework is training-free. 3) The framework offers high interpretability and flexible ways to create audio content. 4) A novel subjective evaluation metrics are proposed.

Weaknesses

1) While the generated results are impressive, this work focuses more on production than academic research. It shows how best we can achieve when combining state-of-the-art models, and the contribution is limited from the perspective of technical novelty. 2) Since the SOTA generative models are not perfect everywhere, sometimes those models might not generate the expected content. It seems like the proposed framework doesn't take it into consideration and may not always be reliable or robust to

Code & Models

Repositories

audio-agi/wavjourney
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies