Accessibility Barriers in Multi-Terabyte Public Datasets: The Gap Between Promise and Practice
Marc Bara

TL;DR
This paper investigates the practical accessibility barriers of large public datasets, revealing that despite being marketed as open, they often require significant resources and expertise, limiting access to well-funded institutions.
Contribution
It highlights the gap between the perceived openness of large datasets and the actual barriers to access, emphasizing infrastructure and cost challenges.
Findings
Most datasets require over $1,000 for meaningful analysis.
Processing complex datasets can cost between $10,000 and $100,000.
Accessibility is limited to well-resourced institutions due to technical and financial barriers.
Abstract
The promise of "free and open" multi-terabyte datasets often collides with harsh realities. While these datasets may be technically accessible, practical barriers -- from processing complexity to hidden costs -- create a system that primarily serves well-funded institutions. This study examines accessibility challenges across web crawls, satellite imagery, scientific data, and collaborative projects, revealing a consistent two-tier system where theoretical openness masks practical exclusivity. Our analysis demonstrates that datasets marketed as "publicly accessible" typically require minimum investments of $1,000+ for meaningful analysis, with complex processing pipelines demanding $10,000-100,000+ in infrastructure costs. The infrastructure requirements -- distributed computing knowledge, domain expertise, and substantial budgets -- effectively gatekeep these datasets despite their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Accessibility for Disabilities · Technology Use by Older Adults · Context-Aware Activity Recognition Systems
