# An Experimental Evaluation of Large Scale GBDT Systems

**Authors:** Fangcheng Fu, Jiawei Jiang, Yingxia Shao, Bin Cui

arXiv: 1907.01882 · 2019-08-06

## TL;DR

This paper systematically evaluates different data management strategies for distributed GBDT systems, introduces a new system Vero with a novel data partitioning approach, and provides guidelines for selecting optimal data management policies based on workload characteristics.

## Contribution

It introduces a quadrant-based categorization of data management policies, proposes the Vero system with a unique data partitioning scheme, and empirically compares various strategies to guide optimal choices.

## Key findings

- Different data management policies significantly impact GBDT performance.
- Vero's vertical partitioning with row-store excels in many large-scale scenarios.
- Guidelines are provided for selecting data management strategies based on workload.

## Abstract

Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm in both data analytic competitions and real-world industrial applications. Further, driven by the rapid increase in data volume, efforts have been made to train GBDT in a distributed setting to support large-scale workloads. However, we find it surprising that the existing systems manage the training dataset in different ways, but none of them have studied the impact of data management. To that end, this paper aims to study the pros and cons of different data management methods regarding the performance of distributed GBDT. We first introduce a quadrant categorization of data management policies based on data partitioning and data storage. Then we conduct an in-depth systematic analysis and summarize the advantageous scenarios of the quadrants. Based on the analysis, we further propose a novel distributed GBDT system named Vero, which adopts the unexplored composition of vertical partitioning and row-store and suits for many large-scale cases. To validate our analysis empirically, we implement different quadrants in the same code base and compare them under extensive workloads, and finally compare Vero with other state-of-the-art systems over a wide range of datasets. Our theoretical and experimental results provide a guideline on choosing a proper data management policy for a given workload.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.01882/full.md

## Figures

45 figures with captions in the complete paper: https://tomesphere.com/paper/1907.01882/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1907.01882/full.md

---
Source: https://tomesphere.com/paper/1907.01882