Pre-training LLMs using human-like development data corpus

Khushi Bhardwaj; Raj Sanjay Shah; Sashank Varma

arXiv:2311.04666·cs.CL·January 11, 2024·1 cites

Pre-training LLMs using human-like development data corpus

Khushi Bhardwaj, Raj Sanjay Shah, Sashank Varma

PDF

Open Access

TL;DR

This paper explores pre-training large language models using a human-like, limited data corpus comparable to what children are exposed to, evaluating their ability to learn contextual representations.

Contribution

It introduces a methodology for pre-training LLMs on small, human-like datasets and provides baseline results and analysis of training robustness and replicability.

Findings

01

LLMs can learn meaningful representations from limited, child-like data.

02

Performance varies with different architectures and training epochs.

03

Training robustness is affected by hyperparameter choices.

Abstract

Pre-trained Large Language Models (LLMs) have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Adam · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Dropout