Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig
Kashish Ara Shakil, Mansaf Alam (Member, IAENG), Shuchi Sethi

TL;DR
This paper analyzes large-scale cloud workload data using Hive and Pig, revealing insights into job clustering, arrival patterns, and resource usage distributions in a heterogeneous and dynamic cloud environment.
Contribution
It introduces a novel analytical method combining Hive and Pig for large-scale cloud workload analysis, providing new insights into workload distributions and clustering.
Findings
Job arrival times follow Weibull distribution
Resource usage distribution is Zipf-like
Process runtimes exhibit heavy-tailed distribution
Abstract
Cloud computing deals with heterogeneity and dynamicity at all levels and therefore there is a need to manage resources in such an environment and properly allocate them. Resource planning and scheduling requires a proper understanding of arrival patterns and scheduling of resources. Study of workloads can aid in proper understanding of their associated environment. Google has released its latest version of cluster trace, trace version 2.1 in November 2014.The trace consists of cell information of about 29 days spanning across 700k jobs. This paper deals with statistical analysis of this cluster trace. Since the size of trace is very large, Hive which is a Hadoop distributed file system (HDFS) based platform for querying and analysis of Big data, has been used. Hive was accessed through its Beeswax interface. The data was imported into HDFS through HCatalog. Apart from Hive, Pig which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Data Stream Mining Techniques · IoT and Edge/Fog Computing
