A Simple and Fast Baseline for Tuning Large XGBoost Models

Sanyam Kapoor; Valerio Perrone

arXiv:2111.06924·cs.LG·November 16, 2021·1 cites

A Simple and Fast Baseline for Tuning Large XGBoost Models

Sanyam Kapoor, Valerio Perrone

PDF

Open Access

TL;DR

This paper introduces a simple, fast baseline method using uniform subsampling for hyperparameter tuning of large XGBoost models, significantly reducing training time while maintaining performance on large tabular datasets.

Contribution

It proposes a novel multi-fidelity hyperparameter optimization approach leveraging data subsampling, improving tuning efficiency for large-scale XGBoost models.

Findings

01

Effective on datasets up to 70GB in size

02

Reduces tuning time significantly

03

Maintains competitive predictive performance

Abstract

XGBoost, a scalable tree boosting algorithm, has proven effective for many prediction tasks of practical interest, especially using tabular datasets. Hyperparameter tuning can further improve the predictive performance, but unlike neural networks, full-batch training of many models on large datasets can be time consuming. Owing to the discovery that (i) there is a strong linear relation between dataset size & training time, (ii) XGBoost models satisfy the ranking hypothesis, and (iii) lower-fidelity models can discover promising hyperparameter configurations, we show that uniform subsampling makes for a simple yet fast baseline to speed up the tuning of large XGBoost models using multi-fidelity hyperparameter optimization with data subsets as the fidelity dimension. We demonstrate the effectiveness of this baseline on large-scale tabular datasets ranging from $15 - 70 GB$ in size.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Anomaly Detection Techniques and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings