The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories
Tarek Ramadan, AbdelRahman Abdou, Mohammad Mannan, Amr Youssef

TL;DR
This paper introduces an automated system to measure and analyze the extent of sensitive data leaks across millions of publicly accessible URLs from various platforms, revealing significant exposure risks.
Contribution
It presents a novel automated approach combining multiple techniques to detect sensitive information leaks at large scale across diverse URL sources.
Findings
Identified over 12,000 potential sensitive data leaks.
Demonstrated the prevalence of accidental sensitive information exposure.
Showcased the effectiveness of combined detection techniques.
Abstract
A large number of URLs are made public by various platforms for security analysis, archiving, and paste sharing -- such as VirusTotal, URLScan.io, Hybrid Analysis, the Wayback Machine, and RedHunt. These services may unintentionally expose links containing sensitive information, as reported in some news articles and blog posts. However, no large-scale measurement has quantified the extent of such exposures. We present an automated system that detects and analyzes potential sensitive information leaked through publicly accessible URLs. The system combines lexical URL filtering, dynamic rendering, OCR-based extraction, and content classification to identify potential leaks. We apply it to 6,094,475 URLs collected from public scanning platforms, paste sites, and web archives, identifying 12,331 potential exposures across authentication, financial, personal, and document-related domains.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Web Data Mining and Analysis
