Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data   Repositories

Luann Jung; Brendan Whitaker; Kyle Chard; Aaron Elmore

arXiv:1810.05784·cs.IR·October 16, 2018

Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories

Luann Jung, Brendan Whitaker, Kyle Chard, Aaron Elmore

PDF

Open Access

TL;DR

This paper introduces an automated clustering method to quantify the organization of large, heterogeneous data repositories, providing a novel 'cleanliness' score to assess chaos and improve data management.

Contribution

It presents a parallel clustering pipeline that processes diverse file types and introduces a new 'cleanliness' metric validated on synthetic and real datasets.

Findings

01

The 'cleanliness' score correlates well with data organization levels.

02

The method outperforms existing measures in consistency.

03

It effectively handles heterogeneous data types.

Abstract

As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel "cleanliness" score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Scientific Computing and Data Management · Big Data and Business Intelligence