ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
Khanh Chi Le, Linghe Wang, Minhwa Lee, Ross Volkov, Luan Tuyen Chau, Dongyeop Kang

TL;DR
ScholaWrite is a comprehensive dataset capturing the entire scholarly writing process, including detailed annotations and keystroke data, to improve AI writing assistants aligned with scientists' cognitive workflows.
Contribution
It introduces a novel end-to-end writing dataset with fine-grained annotations and keystroke recordings, enabling better understanding and support of the scholarly writing process.
Findings
Collected 62K text changes over four months
Identified gaps between human writing and LLM capabilities
Provided insights into micro-dynamics of scholarly writing
Abstract
Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
