EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

L\'eane Jourdan; Julien Aubert-B\'educhaud; Yannis Chupin; Marah Baccari; Florian Boudin

arXiv:2603.28515·cs.CL·March 31, 2026

EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

L\'eane Jourdan, Julien Aubert-B\'educhaud, Yannis Chupin, Marah Baccari, Florian Boudin

PDF

TL;DR

EarlySciRev is a new dataset capturing early-stage scientific revisions from LaTeX source files, enabling research on revision behavior and LLM-assisted scientific writing.

Contribution

It introduces a large, validated dataset of authentic early drafting revisions extracted from arXiv LaTeX sources, filling a gap in existing resources.

Findings

01

Extracted 578k genuine revision pairs from 1.28M candidates

02

Provided a human-annotated benchmark for revision detection

03

Supports research on scientific writing dynamics and revision modeling

Abstract

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.