TL;DR
This paper explores diverse metadata types beyond URLs to enhance Large Language Model pretraining efficiency, demonstrating that fine-grained, quality-related metadata and auxiliary tasks can accelerate training.
Contribution
It introduces new metadata integration methods, including appending and learnable meta-tokens, and provides analysis of how metadata influences learning in LLM pretraining.
Findings
Fine-grained quality indicators accelerate pretraining.
Metadata appending with auxiliary tasks improves training speed.
Learnable meta-tokens induce quality-aware latent structures.
Abstract
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
