TL;DR
This paper introduces a compression scheme for boosted decision trees that significantly reduces model size, enabling efficient deployment on resource-constrained IoT devices without sacrificing performance.
Contribution
It presents a novel training method and memory layout for compact boosted decision trees, achieving 4-16x compression while maintaining accuracy.
Findings
Models achieve same performance with 4-16x smaller size.
Enables IoT devices to operate independently with minimal energy.
Facilitates remote monitoring and real-time analytics in power-limited environments.
Abstract
Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4-16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This…
Peer Reviews
Decision·ICLR 2026 Poster
The work tackles a concrete real-world problem: DTs on microcontrollers, where RAM and flash budgets are very limited. The article is well-written, structured, and easy to follow. The proposed method is described in sufficient detail. Specifically, the introduction of ensemble-level penalties using features and thresholds is a simple yet effective idea to induce parameter reuse, which complements post-training pruning/quantization. Furthermore, the design choice to store the threshold bit-widt
The primary motivation for this work is the deployment of GBDT on resource-constrained devices, where memory is a critical constraint, as well as latency (and energy consumption). The authors provide results of memory savings, but do not present any experimental results on inference speed (and energy consumption). Without this analysis, the practical utility of ToaD for real-time edge applications remains unproven. The authors state that the $RF$ is “the ratio between the global number of value
1. The work enables complex models like GBDTs to run on severely memory-constrained microcontrollers. 2. The proposed pointer-less array-based memory layout is highly suitable for microcontroller deployment, minimizing memory footprint and avoiding inefficient pointer chasing. 3. The method is orthogonal to many existing compressions like pruning and quantization, and easy to integrate into existing work 4. The experimental evaluation is comprehensive and sound
1. Potential inference latency overhead. The decoding process involving bit-level manipulations and lookups in global arrays may inherently more computationally expensive than the direct pointer-based traversal used in standard implementations. The paper would be significantly strengthened by an end-to-end latency evaluation. 2. Linear penalty is not motivated theoretically (e.g., from a Bayesian perspective) or empirically against other potential forms. Would a logarithmic penalty, which might
Penalizing the use of new features/thresholds encourages reuse across trees. The idea of feature reuse and threshold reuse is interesting and simple yet useful for sustainable ML. Introduction of a new loss function. The work is very useful for doing efficient ML on resource-constrained devices. Good sensitivity analysis. I found the writing to be decent, and the paper is well structured. The performance gains with minimal memory consumption compared to SOTA methods.
The part on memory layout based on encoding the information in a bit-wise manner is not novel. No actual implementation on MCUs, which makes this a purely algorithmic work. A bit more analysis, including power consumption and energy efficiency, is required. Domains requiring distinct rules (e.g., heterogeneous datasets) do not allow threshold reuse without performance loss. The idea is good, but the utility of it is limited. For example, it does not make sense to train ML models on tiny MCU
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
