X-raying the arXiv: A Large-Scale Analysis of arXiv Submissions' Source Files
Giovanni Apruzzese, Aurore Fass

TL;DR
This study analyzes the source files of over 600,000 arXiv submissions from 2015 to 2025, revealing significant redundant and potentially sensitive data, and proposes tools to improve data hygiene.
Contribution
It provides the first large-scale longitudinal analysis of arXiv source files, quantifies redundant data, and introduces automated detection tools for better data management.
Findings
27% of source file data is unnecessary
Over 580 GB of redundant data identified
Presence of offensive and sensitive information
Abstract
arXiv is the largest open-access repository for scientific literature. When submitting a paper, authors upload the manuscript's source files, from which the final PDF is compiled. These source files are also publicly downloadable, potentially exposing data unrelated to the published paper -- such as figures, documents, or comments -- that may unintentionally reveal confidential information or simply waste storage space. We thus ask ourselves: "What can be found within the source files of arXiv submissions?" We present a longitudinal analysis of ~600,000 submissions appeared on arXiv between 2015--2025. For each submission, we examine the uploaded source files to quantify and characterize data not required for producing the respective PDF. On average, 27% of the data in each submission are unnecessary, totaling >580 GB of redundant content across our dataset. Qualitative inspection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAcademic Publishing and Open Access · Research Data Management Practices · Academic integrity and plagiarism
