Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Yoshinori Nomura

arXiv:2604.21265·cs.CL·April 24, 2026

Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Yoshinori Nomura

PDF

TL;DR

Pre-training Transformers on music before language tasks accelerates language learning and improves performance, with benefits depending on model capacity and data volume.

Contribution

This work demonstrates that structured human creative outputs like music can serve as effective pre-training data for small language models, enhancing learning efficiency.

Findings

01

Music pre-training yields 17.5% perplexity reduction over random initialization.

02

The pipeline converges faster and to a lower loss across multiple seeds.

03

Optimal data volume for pre-training depends on model capacity.

Abstract

We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music $\to$ poetry $\to$ prose -- yields a $17.5%$ perplexity improvement over random initialization ( $p < 0.001$ , 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at $d = 64$ , multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau ( $p = 0.017$ ), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ( $- 3% \to + 3% \to + 6%$ advantage of larger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.