Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation
Botao Yu, Peiling Lu, Rui Wang, Wei Hu, Xu Tan, Wei Ye, Shikun Zhang,, Tao Qin, Tie-Yan Liu

TL;DR
Museformer introduces a novel Transformer architecture with combined fine- and coarse-grained attention mechanisms, enabling efficient modeling of long music sequences and capturing musical structures more effectively.
Contribution
The paper proposes Museformer, a Transformer variant with dual attention mechanisms that improve long-sequence music generation and structural modeling.
Findings
Can model over 3 times longer music sequences than full-attention models
Generates high-quality music with better structural coherence
Outperforms existing models in objective and subjective evaluations
Abstract
Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e.g., over 10,000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Neuroscience and Music Perception
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Adam · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding
