Evaluating Hadoop Clusters with TPCx-HS
Todor Ivanov, Sead Izberovic

TL;DR
This paper evaluates how network configurations affect Hadoop cluster performance using the TPCx-HS benchmark, revealing significant performance gains with dedicated networks despite similar costs.
Contribution
It provides an experimental comparison of shared versus dedicated network setups for Hadoop clusters using the TPCx-HS benchmark.
Findings
Dedicated networks outperform shared networks in Hadoop performance.
Performance gains justify the minimal additional cost.
TPCx-HS effectively stresses network impact on Hadoop clusters.
Abstract
The growing complexity and variety of Big Data platforms makes it both difficult and time consuming for all system users to properly setup and operate the systems. Another challenge is to compare the platforms in order to choose the most appropriate one for a particular application. All these factors motivate the need for a standardized Big Data benchmark that can help the users in the process of platform evaluation. Just recently TPCx-HS [1][2] has been released as the first standardized Big Data benchmark designed to stress test a Hadoop cluster. The goal of this study is to evaluate and compare how the network setup influences the performance of a Hadoop cluster. In particular, experiments were performed using shared and dedicated 1Gbit networks utilized by the same Cloudera Hadoop Distribution (CDH) cluster setup. The TPCx-HS benchmark, which is very network intensive, was used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Data Management and Algorithms · Advanced Database Systems and Queries
