BandCondiNet: Parallel Transformers-based Conditional Popular Music Generation with Multi-View Features
Jing Luo, Xinyu Yang, Dorien Herremans

TL;DR
BandCondiNet is a parallel Transformer-based model for conditional multitrack music generation that improves fidelity, structure, and inter-track harmony using multi-view features and specialized modules, outperforming existing models.
Contribution
The paper introduces BandCondiNet, a novel parallel Transformer architecture with multi-view features and modules for enhanced multitrack music generation, addressing fidelity and harmony challenges.
Findings
Outperforms other models on multiple metrics in fidelity and speed.
Achieves superior subjective quality on longer datasets.
Effectively models musical structure and inter-track harmony.
Abstract
Conditional music generation offers significant advantages in terms of user convenience and control, presenting great potential in AI-generated content research. However, building conditional generative systems for multitrack popular songs presents three primary challenges: insufficient fidelity of input conditions, poor structural modeling, and inadequate inter-track harmony learning in generative models. To address these issues, we propose BandCondiNet, a conditional model based on parallel Transformers, designed to process the multiple music sequences and generate high-quality multitrack samples. Specifically, we propose multi-view features across time and instruments as high-fidelity conditions. Moreover, we propose two specialized modules for BandCondiNet: Structure Enhanced Attention (SEA) to strengthen the musical structure, and Cross-Track Transformer (CTT) to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Human Motion and Animation
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
