arXivEdits: Understanding the Human Revision Process in Scientific Writing
Chao Jiang, Wei Xu, Samuel Stevens

TL;DR
This paper introduces arXivEdits, a comprehensive dataset and computational framework for analyzing the human revision process in scientific writing, covering full papers and detailed edit intentions.
Contribution
It provides the first complete corpus with sentence alignments and edit annotations across full scientific papers, along with automatic methods for fine-grained revision analysis.
Findings
Neural CRF model achieves 93.8 F1 for sentence alignment.
Proposed span alignment method outperforms diff algorithms.
Intent classifier attains 78.9 F1 in edit intention detection.
Abstract
Scientific publications are the primary means to communicate research discoveries, where the writing quality is of crucial importance. However, prior work studying the human editing process in this domain mainly focused on the abstract or introduction sections, resulting in an incomplete picture. In this work, we provide a complete computational framework for studying text revision in scientific writing. We first introduce arXivEdits, a new annotated corpus of 751 full papers from arXiv with gold sentence alignment across their multiple versions of revision, as well as fine-grained span-level edits and their underlying intentions for 1,000 sentence pairs. It supports our data-driven analysis to unveil the common strategies practiced by researchers for revising their papers. To scale up the analysis, we also develop automatic methods to extract revision at document-, sentence-, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗chaojiang06/arxiv-sentence-alignmentmodel· 4 dl4 dl
- 🤗chaojiang06/arXivEdits-intention-classifier-T5-large-coarsemodel· 5 dl5 dl
- 🤗chaojiang06/arXivEdits-intention-classifier-T5-large-fine-grainedmodel· 6 dl6 dl
- 🤗chaojiang06/arXivEdits-intention-classifier-T5-base-coarsemodel· 5 dl5 dl
- 🤗chaojiang06/arXivEdits-intention-classifier-T5-base-fine-grainedmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
MethodsConditional Random Field
