In-Context Prompt Editing For Conditional Audio Generation
Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David, Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra

TL;DR
This paper introduces a retrieval-based in-context prompt editing method that improves the quality of text-to-audio generation by revisiting user prompts with training captions as exemplars, addressing distributional shift issues.
Contribution
The paper proposes a novel retrieval-based prompt editing framework that enhances audio quality in conditional generation by leveraging training captions as exemplars.
Findings
Audio quality improved across user prompts
Prompt editing reduces distributional shift effects
Framework leverages training captions as exemplars
Abstract
Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsSparse Evolutionary Training
