Data-Aware and Scalable Sensitivity Analysis for Decision Tree Ensembles

Namrita Varshney; Ashutosh Gupta; Arhaan Ahmad; Tanay V. Tayal; and S. Akshay

arXiv:2602.07453·cs.LG·February 10, 2026

Data-Aware and Scalable Sensitivity Analysis for Decision Tree Ensembles

Namrita Varshney, Ashutosh Gupta, Arhaan Ahmad, Tanay V. Tayal, and S. Akshay

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a data-aware, scalable sensitivity analysis framework for decision tree ensembles, improving interpretability and robustness verification by generating realistic, close-to-data examples using advanced optimization techniques.

Contribution

It presents novel MILP and SMT-based methods for efficient, data-aware sensitivity verification, including handling multiclass ensembles and large models, with extensive experimental validation.

Findings

01

Scalability to ensembles with up to 800 trees of depth 8.

02

Significant speed-ups over existing methods.

03

Enhanced interpretability through realistic sensitivity examples.

Abstract

Decision tree ensembles are widely used in critical domains, making robustness and sensitivity analysis essential to their trustworthiness. We study the feature sensitivity problem, which asks whether an ensemble is sensitive to a specified subset of features -- such as protected attributes -- whose manipulation can alter model predictions. Existing approaches often yield examples of sensitivity that lie far from the training distribution, limiting their interpretability and practical value. We propose a data-aware sensitivity framework that constrains the sensitive examples to remain close to the dataset, thereby producing realistic and interpretable evidence of model weaknesses. To this end, we develop novel techniques for data-aware search using a combination of mixed-integer linear programming (MILP) and satisfiability modulo theories (SMT) encodings. Our contributions are fourfold.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The paper shows sensitivity verification is NP-hard even for depth-1 trees. 2. The authors implemented their method SVIM and outperformed prior baselines such as SENSPB and KANT. 3. They extended the method beyond binary classification to multi-class problems. 4. The paper is clear in definitions and theoretically well-justified.

Weaknesses

1. Whether the “data-aware” counterexamples lead to practical improvements in model fairness or robustness is not well explored, and remains an interesting practical direction. 2. The independence assumption is mitigated through restricting space, yet it still introduces a theoretical gap between the assumed and real data distribution. Minor formatting: “Figures” vs “Fig”. (Line 456)

Reviewer 02Rating 8Confidence 3

Strengths

- The empirical results for SVIM are quite compelling. SVIM universally improves upon the runtime of prior methods, while producing counterexamples that are designed to be realistic. - The proposed MILP formulation is interesting and involves several novel components. - The problem studied here is interesting and well motivated. - In general, the paper is well written and figures are clear.

Weaknesses

- While the paper is generally easy to follow, it is quite notation heavy and suffers from some minor inconsistencies. In addition to the several specific comments listed below, I recommend adding a notation table to the appendix to help readers keep up. - In EQ 4/5, should $p_{kf}$ be $p_{fk}$? - In Eq Gap-Bin and Off-bin, $v_i$ is undefined. - In the data aware objective function defined under Utility Function, it seems like some things may be off. Should solution (2) be considered

Reviewer 03Rating 4Confidence 4

Strengths

1. The papers organization is clear and logical, and the paper is well situated in existing literature. It is clear from the authors' presentation what is novel, how it is novel, and what research problems each component of the paper are trying to address. 2. The new constraints that speed up optimization are well-reasoned and proven to be correct. 3. The ablation study is convincing that the combination of all methods leads to a greater improvement than any subset. 4. I am not an expert in t

Weaknesses

1. The benchmarking results are too aggregated and are lacking statistical significance testing or error bars. The results should have been obtained over repeat trials for each configuration, with a corresponding mean and standard deviation. Moreover, it seems that most problem instances evaluated in Figure 3 are on only 4 datasets -- the ones with a lot of combinations of # trees and depths. I think that this work requires 1) statistical significance testing on the results over repeat trials fo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks