Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
Haowen Li,Tianxiang Li,Yi Yang,Boyu Cao,Qi Liu

TL;DR
Polyphonia introduces a zero-shot timbre transfer method for polyphonic music that uses acoustic-informed attention calibration to improve stem-specific editing accuracy.
Contribution
It presents a novel framework combining semantic attention with acoustic priors for precise stem-specific timbre transfer in dense musical mixtures.
Findings
Achieves 15.5% higher target alignment than baselines.
Maintains competitive music fidelity.
Effectively localizes targets with reduced boundary leakage.
Abstract
The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
