The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories

Tarek Ramadan; AbdelRahman Abdou; Mohammad Mannan; Amr Youssef

arXiv:2602.21826·cs.CR·February 26, 2026

The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories

Tarek Ramadan, AbdelRahman Abdou, Mohammad Mannan, Amr Youssef

PDF

Open Access

TL;DR

This paper introduces an automated system to measure and analyze the extent of sensitive data leaks across millions of publicly accessible URLs from various platforms, revealing significant exposure risks.

Contribution

It presents a novel automated approach combining multiple techniques to detect sensitive information leaks at large scale across diverse URL sources.

Findings

01

Identified over 12,000 potential sensitive data leaks.

02

Demonstrated the prevalence of accidental sensitive information exposure.

03

Showcased the effectiveness of combined detection techniques.

Abstract

A large number of URLs are made public by various platforms for security analysis, archiving, and paste sharing -- such as VirusTotal, URLScan.io, Hybrid Analysis, the Wayback Machine, and RedHunt. These services may unintentionally expose links containing sensitive information, as reported in some news articles and blog posts. However, no large-scale measurement has quantified the extent of such exposures. We present an automated system that detects and analyzes potential sensitive information leaked through publicly accessible URLs. The system combines lexical URL filtering, dynamic rendering, OCR-based extraction, and content classification to identify potential leaks. We apply it to 6,094,475 URLs collected from public scanning platforms, paste sites, and web archives, identifying 12,331 potential exposures across authentication, financial, personal, and document-related domains.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Web Data Mining and Analysis