Probing Audio-Generation Capabilities of Text-Based Language Models

Arjun Prasaath Anbazhagan; Parteek Kumar; Ujjwal Kaur; Aslihan Akalin; Kevin Zhu; Sean O'Brien

arXiv:2506.00003·cs.SD·June 3, 2025

Probing Audio-Generation Capabilities of Text-Based Language Models

Arjun Prasaath Anbazhagan, Parteek Kumar, Ujjwal Kaur, Aslihan Akalin, Kevin Zhu, Sean O'Brien

PDF

Open Access 1 Video

TL;DR

This paper explores the potential of large language models to generate audio from text prompts by using code as an intermediary, revealing their limited capabilities as audio complexity increases.

Contribution

It introduces a three-tier approach to prompt LLMs for audio generation across different complexity levels and evaluates their performance with specific metrics.

Findings

01

LLMs can generate basic audio features

02

Performance declines with increasing audio complexity

03

Latent understanding of auditory world exists in LLMs

Abstract

How does textual representation of audio relate to the Large Language Model's (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Probing Audio-Generation Capabilities of Text-Based Language Models· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies