What is the Best Sequence Length for BABYLM?
Suchir Salhan, Richard Diehl Martinez, Z\'ebulon Goriely, Paula Buttery

TL;DR
This paper investigates how sequence length affects the training of BabyLM language models, finding that optimal length varies by task and architecture, with longer sequences aiding complex reasoning.
Contribution
It provides an empirical analysis of sequence length effects on BabyLM pretraining, highlighting task-dependent optimal lengths and guiding future model training choices.
Findings
Longer sequences generally improve performance.
Shorter sequences suffice for grammatical tasks.
Longer contexts benefit morphological reasoning.
Abstract
Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Language Development and Disorders
