Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan; Diba Hashemi; Sai Praneeth Karimireddy; Martin Jaggi

arXiv:2511.21613·cs.CL·April 21, 2026

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi

PDF

1 Video

TL;DR

This paper explores diverse metadata types beyond URLs to enhance Large Language Model pretraining efficiency, demonstrating that fine-grained, quality-related metadata and auxiliary tasks can accelerate training.

Contribution

It introduces new metadata integration methods, including appending and learnable meta-tokens, and provides analysis of how metadata influences learning in LLM pretraining.

Findings

01

Fine-grained quality indicators accelerate pretraining.

02

Metadata appending with auxiliary tasks improves training speed.

03

Learnable meta-tokens induce quality-aware latent structures.

Abstract

Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining· slideslive