Training Language Models via Neural Cellular Automata
Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal

TL;DR
This paper introduces neural cellular automata (NCA) to generate synthetic data for pre-training large language models, improving efficiency and performance while reducing reliance on natural language data.
Contribution
It demonstrates that pre-training on synthetic NCA data can outperform natural language pre-training in certain settings, offering a controllable and scalable alternative.
Findings
Pre-training on 164M NCA tokens improves downstream performance by up to 6%.
NCA pre-training accelerates convergence by up to 1.6x.
Synthetic NCA data can outperform large natural language datasets in pre-training efficiency.
Abstract
Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science
