# Learning Interpretable Models Using Uncertainty Oracles

**Authors:** Abhishek Ghose, Balaraman Ravindran

arXiv: 1906.06852 · 2024-08-26

## TL;DR

This paper introduces a novel method for learning small, interpretable models that maintains accuracy by encoding training distributions with a Dirichlet Process and using an uncertainty oracle for dimensionality reduction, applicable across various model types.

## Contribution

The authors propose a new technique combining Dirichlet Process encoding and uncertainty scores to improve small model learning, applicable to different model families and size notions.

## Key findings

- Significant accuracy improvement over baselines, up to 100%.
- Applicable to multiple model types including decision trees and gradient boosting.
- Requires only one hyperparameter for practical use.

## Abstract

A desirable property of interpretable models is small size, so that they are easily understandable by humans. This leads to the following challenges: (a) small sizes typically imply diminished accuracy, and (b) bespoke levers provided by model families to restrict size, e.g., L1 regularization, might be insufficient to reach the desired size-accuracy trade-off. We address these challenges here. Earlier work has shown that learning the training distribution creates accurate small models. Our contribution is a new technique that exploits this idea. The training distribution is encoded as a Dirichlet Process to allow for a flexible number of modes that is learnable from the data. Its parameters are learned using Bayesian Optimization; a design choice that makes the technique applicable to non-differentiable loss functions. To avoid the challenges with high dimensionality, the data is first projected down to one-dimension using uncertainty scores of a separate probabilistic model, that we refer to as the uncertainty oracle. We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe $\sim 100\%$ improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models. Additionally, we show that (1) it is more accurate than its predecessor, (2) requires only one hyperparameter to be set in practice, (3) accommodates a multi-variate notion of model size, e.g., both maximum depth of a tree and number of trees in Gradient Boosted Models, and (4) works across different feature spaces between the uncertainty oracle and the interpretable model, e.g., a GRU might act as an oracle for a decision tree that ingests n-grams.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.06852/full.md

## Figures

27 figures with captions in the complete paper: https://tomesphere.com/paper/1906.06852/full.md

## References

134 references — full list in the complete paper: https://tomesphere.com/paper/1906.06852/full.md

---
Source: https://tomesphere.com/paper/1906.06852