ALOJA: A Framework for Benchmarking and Predictive Analytics in Big Data Deployments
Josep Ll. Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob, Reinauer, Daron Green

TL;DR
ALOJA is a comprehensive framework that uses machine learning to analyze Hadoop performance data, enabling cost-effective deployment, predictive modeling, and optimization of Big Data systems.
Contribution
It introduces an open repository of Hadoop benchmarks and a machine learning extension for automated performance modeling and analysis.
Findings
Created a repository with over 40,000 Hadoop job executions
Developed predictive models for execution time forecasting
Enabled anomaly detection and benchmark prioritization
Abstract
This article presents the ALOJA project and its analytics tools, which leverages machine learning to interpret Big Data benchmark performance data and tuning. ALOJA is part of a long-term collaboration between BSC and Microsoft to automate the characterization of cost-effectiveness on Big Data deployments, currently focusing on Hadoop. Hadoop presents a complex run-time environment, where costs and performance depend on a large number of configuration choices. The ALOJA project has created an open, vendor-neutral repository, featuring over 40,000 Hadoop job executions and their performance details. The repository is accompanied by a test-bed and tools to deploy and evaluate the cost-effectiveness of different hardware configurations, parameters and Cloud services. Despite early success within ALOJA, a comprehensive study requires automation of modeling procedures to allow an analysis of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
