Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints
Jan Pennekamp, Johannes Lohm\"oller, David Sch\"utte, Joscha Loos, Martin Henze

TL;DR
This paper systematically investigates unintentional information disclosures in arXiv source files, revealing widespread hidden data and evaluating cleaning tools, and introduces a new tool to mitigate these issues.
Contribution
It provides the first comprehensive analysis of sensitive information leaks in arXiv sources and proposes an improved tool for cleaning such disclosures.
Findings
Nearly all arXiv submissions contain some form of hidden information.
Existing tools often fail to reliably remove all sensitive data.
The proposed ALC-NG tool effectively mitigates information leaks.
Abstract
Preprints are essential for the timely and open dissemination of research. arXiv, the most widely used preprint service, takes the idea of open science one step further by not only publishing the actual preprints but also LaTeX sources and other files used to create them. As known from other contexts, such as GitHub repositories, and anecdotally exemplified for arXiv, making source code publicly available risks disclosing otherwise "hidden" information. Consequently, the public availability of paper sources raises the question of how much sensitive content is (unintentionally) disclosed through them. In this paper, we systematically answer this question for all 2.7M arXiv submissions with available source files across three dimensions of source file-induced information disclosure: (1) inclusion of unnecessary files, (2) metadata embedded in files, and (3) irrelevant content in files…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
