Reconciling Model Multiplicity for Downstream Decision Making
Ally Yalei Du, Dung Daniel Ngo, Zhiwei Steven Wu

TL;DR
This paper addresses the challenge of model multiplicity in decision-making, proposing a calibration framework that aligns predictive models with downstream tasks, improving decision accuracy and model agreement.
Contribution
The paper introduces a novel calibration algorithm that reconciles predictive models for better downstream decision-making, even without direct access to true probability distributions.
Findings
Improved downstream decision-making losses with calibrated models
Models achieve near-universal agreement on best-response actions
Algorithm effective with empirical data sets
Abstract
We consider the problem of model multiplicity in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on the best-response action for a downstream loss function. We show that even when the two predictive models approximately agree on their individual predictions almost everywhere, it is still possible for their induced best-response actions to differ on a substantial portion of the population. We address this issue by proposing a framework that calibrates the predictive models with regard to both the downstream decision-making problem and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper studied an important and practical problem of model multiplicity. The paper overall is well-organized and presented with a good clarity. - It is in particular helpful to have the illustrative example in Figure 1, which directly shows that it is insufficient to only update two predictive models so that they have improved squared loss and nearly agree on their individual predictions almost everywhere. - Theoretical guarantee shows that the new algorithm ReDCal provides an improved ac
- The experimental results with the HAM10000 dataset show substantially larger error bars, and much less smooth convergence. It is helpful to provide more details on this differences between the two sets of results. - The experiments only compared to one other baseline proposed in (Roth et al 2023). How does the proposed algorithm compared to other related works in the model multiplicity?
The paper highlights an important problem, that improvements to prediction models can hurt downstream decision-making since downstream decision-makers may have loss functions that do not necessarily align with prediction accuracy. The paper combines existing work in multi-calibration with work in model multiplicity to solve this problem. The algorithm proposed by the paper seems novel and provides what seems to be sensible theoretical guarantees that trade-off between improvements to prediction
The paper has weaknesses in its presentation as well as results that seem somewhat suspicious/hard to interpret precisely. The following items could be addressed and improved for presentation: - In the paper's introduction, the authors mention calibration several times, but for someone not immediately familiar with the literature it's hard to understand what it is formally. It becomes a little better defined at Lemma 2.6, but having extra background or explanation in the introduction would be h
This paper is overall a good and novel contribution to the predictive multiplicity literature. Specifically: + The contribution of this paper appears new in the model multiplicity literature: it gives a rigorous method for reconciliation of any two models with provable guarantees in terms of the resulting model loss (for which only one algorithm exists in the literature), while at the same time ensuring in a rigorous way that downstream decision making is not affected negatively (which is new);
While overall I believe this to be a good-quality paper, there is the following (relatively non-major) consideration that I would call a weakness: - The paper currently appears written with a primarily theoretical audience in mind, but I think it could still do a better/more thorough job coming up with/describing experiments. It currently gives two semi-synthetic ones. In the first one, linear decision losses are generated in a Gaussian manner --- so that the two vision models are essentially b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
MethodsSparse Evolutionary Training
