SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation
Hongrui Wang, Fan Zhang, Zhiyuan Yu, Ziya Zhou, Xi Chen, Can Yang, Yang Wang

TL;DR
SyncTrack is a novel multi-track music generation model that emphasizes rhythmic stability and synchronization, using shared and specific modules with attention mechanisms and instrument priors, validated by new rhythmic metrics.
Contribution
Introduces SyncTrack, a new architecture for multi-track music generation that improves rhythmic stability and synchronization through innovative modules and evaluation metrics.
Findings
Significantly improves rhythmic consistency in generated music.
Enhances multi-track harmony with synchronized rhythms.
Introduces three novel metrics for rhythmic evaluation.
Abstract
Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other…
Peer Reviews
Decision·ICLR 2026 Poster
1.The modules designed by the authors are highly aligned with the current needs of music generation technologies, and their effectiveness is demonstrated from both temporal and spatial perspectives. 2.The authors propose three novel metrics to evaluate the quality of multi-track music generation, addressing gaps in existing evaluation methods. 3.The experiments conducted by the authors are thorough and provide strong evidence supporting the effectiveness of the proposed approach.
1.The authors do not provide a clear explanation of the relationship between tracks, timbre, and rhythm in the music generation task, which may cause some difficulty in understanding the proposed approach. 2.The ablation studies are limited and the results are relatively average, making them less convincing in demonstrating the contributions of specific components. 3.The paper does not include a user study, leaving the subjective evaluation of the generated music quality unaddressed.
Very clear problem–solution alignment. The paper does not vaguely say “quality is low”; it pinpoints “rhythmic stability and cross-track synchronization are not modeled or evaluated,” and the proposed modules map 1:1 to that diagnosis. Metrics with reuse potential. IRS/CBS/CBD are defined in a way that any multi-track model with separated stems can use; this makes the paper more than “a new model,” it is also “a more appropriate test.” Realistic multi-track setup. Using four typical producti
Beat-detection dependency. All three rhythm metrics assume that beat/onset tracking works reasonably well. For highly expressive, weakly pulsed, or rubato multi-track music, tracking can fail, which would make IRS/CBS/CBD less reliable. The paper tests robustness to hyperparameters, but not to style changes; adding such a test would make the metric story stronger. Dataset narrowness. Most results are on the Slakh2100 four-track configuration, which is clean and well-aligned. It is unclear how w
1. The paper introduces clear and reproducible rhythm-centric evaluation metrics (IRS/CBS/CBD) and provides robustness analyses, making these metrics potentially useful for broader community adoption. 2. The results demonstrate good alignment between objective and subjective evaluations: FAD improvements are consistent with listening preferences, and component-level ablations further validate the architectural design.
1. The current setup (10.24 s at 16 kHz) constrains both long-range musical structure and high-frequency detail. Including evaluations on longer segments (≥30–60 s) or full-song contexts would strengthen the claims. 2. Key details such as the number of participants, votes per sample, confidence intervals, or statistical significance, and loudness normalization procedures should be clearly reported in the main paper.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Artificial Intelligence in Games
