Observational Multiplicity

Erin George; Deanna Needell; Berk Ustun

arXiv:2507.23136·cs.LG·August 1, 2025

Observational Multiplicity

Erin George, Deanna Needell, Berk Ustun

PDF

TL;DR

This paper investigates how multiple nearly equivalent models in probabilistic classification can cause arbitrariness and unpredictability in predictions, proposing a regret-based measure to evaluate and mitigate this issue for safer AI deployment.

Contribution

It introduces a novel regret measure for probabilistic classifiers, providing a way to quantify and analyze arbitrariness in model predictions due to observational multiplicity.

Findings

01

Regret varies across different groups in datasets.

02

Estimating regret can improve safety through abstention strategies.

03

The method applies broadly to practical classification tasks.

Abstract

Many prediction tasks can admit multiple models that can perform almost equally well. This phenomenon can can undermine interpretability and safety when competing models assign conflicting predictions to individuals. In this work, we study how arbitrariness can arise in probabilistic classification tasks as a result of an effect that we call \emph{observational multiplicity}. We discuss how this effect arises in a broad class of practical applications where we learn a classifier to predict probabilities $p_{i} \in [0, 1]$ but are given a dataset of observations $y_{i} \in {0, 1}$ . We propose to evaluate the arbitrariness of individual probability predictions through the lens of \emph{regret}. We introduce a measure of regret for probabilistic classification tasks, which measures how the predictions of a model could change as a result of different training labels change. We present a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.