Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, Tal Linzen

TL;DR
Pretraining language models on formal languages that mimic natural language dependencies enhances their linguistic capabilities and efficiency, with hierarchical formal languages providing significant transfer benefits and mechanistic evidence supporting this transfer.
Contribution
This study demonstrates that pre-pretraining on hierarchical formal languages improves natural language understanding and transfer efficiency in transformer models, highlighting the importance of dependency structures.
Findings
Formal languages capturing hierarchical dependencies improve natural language loss.
Pre-pretraining reduces training token requirements by 33%.
Attention heads trained on formal languages remain crucial for syntactic tasks.
Abstract
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
