Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training
Yoshinori Nomura

TL;DR
Pre-training Transformers on music before language tasks accelerates language learning and improves performance, with benefits depending on model capacity and data volume.
Contribution
This work demonstrates that structured human creative outputs like music can serve as effective pre-training data for small language models, enhancing learning efficiency.
Findings
Music pre-training yields 17.5% perplexity reduction over random initialization.
The pipeline converges faster and to a lower loss across multiple seeds.
Optimal data volume for pre-training depends on model capacity.
Abstract
We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music poetry prose -- yields a perplexity improvement over random initialization (, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at , multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau (), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ( advantage of larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
