FxSearcher: gradient-free text-driven audio transformation
Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim

TL;DR
FxSearcher is a gradient-free framework that uses Bayesian Optimization and CLAP to efficiently discover audio effect configurations for text-driven audio transformation, achieving results aligned with human preferences.
Contribution
It introduces a novel gradient-free approach combining Bayesian Optimization and CLAP for text-driven audio effects discovery, with an AI-based evaluation framework.
Findings
High alignment with human preferences in audio transformation quality
Effective discovery of audio effects configurations without gradients
Demonstrated superior performance over baseline methods
Abstract
Achieving diverse and high-quality audio transformations from text prompts remains challenging, as existing methods are fundamentally constrained by their reliance on a limited set of differentiable audio effects. This paper proposes FxSearcher, a novel gradient-free framework that discovers the optimal configuration of audio effects (FX) to transform a source signal according to a text prompt. Our method employs Bayesian Optimization and CLAP-based score function to perform this search efficiently. Furthermore, a guiding prompt is introduced to prevent undesirable artifacts and enhance human preference. To objectively evaluate our method, we propose an AI-based evaluation framework. The results demonstrate that the highest scores achieved by our method on these metrics align closely with human preferences. Demos are available at https://hojoonki.github.io/FxSearcher/
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
