S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation
Huakang Chen, Wenkai Cheng, Guobin Ma, Chunbo Hao, Yuxuan Xia, Mengqi Wei, Zhixian Zhao, Pengcheng Zhu, Hanbing Zhang, Lei Xie

TL;DR
S2Accompanist is a diffusion model that enhances music accompaniment generation by incorporating semantic awareness and structural guidance, achieving state-of-the-art results with limited data and computational resources.
Contribution
The paper introduces a novel semantic-aware and structure-guided diffusion model with an automated data pipeline and a specialized fine-tuning strategy for improved music accompaniment.
Findings
Achieved state-of-the-art performance on the ATTM Grand Challenge benchmark.
Secured first place in the Efficiency Track with only 402M parameters.
Demonstrated competitive results compared to larger models.
Abstract
High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coarse, track-level annotations. To address these limitations under constrained data and computing resources, we propose S2Accompanist, a Semantic-Aware and Structure-Guided Diffusion Model developed for the ICME2026 ATTM Grand Challenge. Specifically, we design an automated data pipeline comprising structural segmentation, Large Audio-Language Model driven segment-level captioning, and dual-metric quality grading to overcome the absence of localized metadata in raw datasets. Furthermore, we propose a semantic-aware Variational Autoencoder fine-tuning strategy that explicitly distills foundational LeadSheet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
