FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

Yuxuan Jiang; Zehua Chen; Zeqian Ju; Chang Li; Weibei Dou; Jun Zhu

arXiv:2507.08557·cs.SD·September 19, 2025

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, Jun Zhu

PDF

TL;DR

FreeAudio introduces a training-free framework for precise timing-controlled long-form text-to-audio generation, leveraging language models for planning and novel attention mechanisms for quality and consistency.

Contribution

It is the first training-free approach enabling long-form timing-controlled T2A generation with innovative planning and attention techniques.

Findings

01

Achieves state-of-the-art quality among training-free methods.

02

Comparable to training-based methods in long-form generation.

03

Demonstrates effective timing control in complex audio prompts.

Abstract

Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.