What is the Best Sequence Length for BABYLM?

Suchir Salhan; Richard Diehl Martinez; Z\'ebulon Goriely; Paula Buttery

arXiv:2510.19493·cs.CL·October 23, 2025

What is the Best Sequence Length for BABYLM?

Suchir Salhan, Richard Diehl Martinez, Z\'ebulon Goriely, Paula Buttery

PDF

Open Access

TL;DR

This paper investigates how sequence length affects the training of BabyLM language models, finding that optimal length varies by task and architecture, with longer sequences aiding complex reasoning.

Contribution

It provides an empirical analysis of sequence length effects on BabyLM pretraining, highlighting task-dependent optimal lengths and guiding future model training choices.

Findings

01

Longer sequences generally improve performance.

02

Shorter sequences suffice for grammatical tasks.

03

Longer contexts benefit morphological reasoning.

Abstract

Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Language Development and Disorders