A Survey on Sampling and Profiling over Big Data (Technical Report)
Zhicheng Liu, Aoqian Zhang

TL;DR
This survey reviews sampling and profiling techniques in big data, highlighting how sampling reduces data volume and accelerates processing, with experimental evidence showing sampled data can yield comparable or better results than full datasets.
Contribution
It provides a comprehensive overview of sampling and profiling methods in big data, emphasizing their importance and potential for future applications.
Findings
Sampling methods effectively reduce data size and processing time.
Sampled data can produce results close to or better than full data.
Sampling is crucial for scalable big data analysis.
Abstract
Due to the development of internet technology and computer science, data is exploding at an exponential rate. Big data brings us new opportunities and challenges. On the one hand, we can analyze and mine big data to discover hidden information and get more potential value. On the other hand, the 5V characteristic of big data, especially Volume which means large amount of data, brings challenges to storage and processing. For some traditional data mining algorithms, machine learning algorithms and data profiling tasks, it is very difficult to handle such a large amount of data. The large amount of data is highly demanding hardware resources and time consuming. Sampling methods can effectively reduce the amount of data and help speed up data processing. Hence, sampling technology has been widely studied and used in big data context, e.g., methods for determining sample size, combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Data Mining Algorithms and Applications
