Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations
Jiahui Wu

TL;DR
This paper evaluates the robustness of text-to-audio models under controlled linguistic variations, revealing that larger models are more consistent but all exhibit fragility during semantic-to-audio conversion.
Contribution
It introduces a novel framework for assessing semantic robustness in text-to-audio systems across multiple representational levels.
Findings
Larger models like MusicGen-large show higher semantic consistency.
All models exhibit divergence in acoustic and temporal features despite high embedding similarity.
Fragility mainly occurs during semantic-to-acoustic realization, not in multi-modal embedding alignment.
Abstract
Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
