A Random Sample Partition Data Model for Big Data Analysis
Salman Salloum, Yulin He, Joshua Zhexue Huang, Xiaoliang, Zhang, Tamer Z. Emara, Chenghao Wei, Heping He

TL;DR
This paper introduces a random sample partition (RSP) data model that divides big data into statistically similar blocks, enabling scalable analysis through efficient block-level sampling instead of costly record-level sampling.
Contribution
The paper proposes a novel RSP data model that facilitates scalable big data analysis by using pre-generated data blocks with similar distributions to the entire dataset.
Findings
RSP data blocks can accurately estimate dataset statistics.
Block-level sampling reduces computational costs.
Analysis on RSP blocks yields results comparable to full dataset analysis.
Abstract
Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) data model to represent a big data set as a set of non-overlapping data subsets, called RSP data blocks, where each RSP data block has a probability distribution similar to the whole big data set. Under this data model, efficient block level sampling is used to randomly select RSP data blocks, replacing expensive record level sampling to select sample data from a big distributed data set on a computing cluster. We show how RSP data blocks can be employed to estimate statistics of a big data set and build models which are equivalent to those built from the whole big data set. In this approach, analysis of a big data set becomes analysis of few RSP data blocks which have been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
