X-raying the arXiv: A Large-Scale Analysis of arXiv Submissions' Source Files

Giovanni Apruzzese; Aurore Fass

arXiv:2601.11385·cs.NI·January 19, 2026

X-raying the arXiv: A Large-Scale Analysis of arXiv Submissions' Source Files

Giovanni Apruzzese, Aurore Fass

PDF

Open Access

TL;DR

This study analyzes the source files of over 600,000 arXiv submissions from 2015 to 2025, revealing significant redundant and potentially sensitive data, and proposes tools to improve data hygiene.

Contribution

It provides the first large-scale longitudinal analysis of arXiv source files, quantifies redundant data, and introduces automated detection tools for better data management.

Findings

01

27% of source file data is unnecessary

02

Over 580 GB of redundant data identified

03

Presence of offensive and sensitive information

Abstract

arXiv is the largest open-access repository for scientific literature. When submitting a paper, authors upload the manuscript's source files, from which the final PDF is compiled. These source files are also publicly downloadable, potentially exposing data unrelated to the published paper -- such as figures, documents, or comments -- that may unintentionally reveal confidential information or simply waste storage space. We thus ask ourselves: "What can be found within the source files of arXiv submissions?" We present a longitudinal analysis of ~600,000 submissions appeared on arXiv between 2015--2025. For each submission, we examine the uploaded source files to quantify and characterize data not required for producing the respective PDF. On average, 27% of the data in each submission are unnecessary, totaling >580 GB of redundant content across our dataset. Qualitative inspection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAcademic Publishing and Open Access · Research Data Management Practices · Academic integrity and plagiarism