AudioGenX: Explainability on Text-to-Audio Generative Models
Hyunju Kang, Geonhee Han, Yoonjae Jeong, Hogun Park

TL;DR
AudioGenX is a novel explainability method for text-to-audio models that highlights input token importance, improving transparency and trustworthiness of audio generation from text descriptions.
Contribution
We introduce AudioGenX, an explainability technique that uses factual and counterfactual objectives to provide faithful, token-level explanations for text-to-audio models.
Findings
AudioGenX produces more faithful explanations than existing methods.
The method enhances understanding of text-to-audio relationships.
Experimental results validate the effectiveness of AudioGenX with new evaluation metrics.
Abstract
Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
