Online Decision Deferral under Budget Constraints
Mirabel Reid, Tom S\"uhr, Claire Vernade, Samira Samadi

TL;DR
This paper introduces an adaptive online decision deferral framework using a contextual bandit model that manages budget constraints and partial feedback, improving decision-making efficiency in dynamic, real-world scenarios.
Contribution
It presents a novel contextual bandit approach for online decision deferral with budget constraints, including theoretical guarantees and practical extensions.
Findings
Algorithm achieves strong theoretical performance guarantees.
Extensions demonstrate high effectiveness on real-world datasets.
Framework adapts to changing task distributions and feedback types.
Abstract
Machine Learning (ML) models are increasingly used to support or substitute decision making. In applications where skilled experts are a limited resource, it is crucial to reduce their burden and automate decisions when the performance of an ML model is at least of equal quality. However, models are often pre-trained and fixed, while tasks arrive sequentially and their distribution may shift. In that case, the respective performance of the decision makers may change, and the deferral algorithm must remain adaptive. We propose a contextual bandit model of this online decision making problem. Our framework includes budget constraints and different types of partial feedback models. Beyond the theoretical guarantees of our algorithm, we propose efficient extensions that achieve remarkable performance on real-world datasets.
Peer Reviews
Decision·Submitted to ICLR 2025
The topic of the manuscript originates from a practical problem.
The original contribution of the paper is hard to justify as it is largely based on the work of Agrawal and Devanur (2016).
1. The problem formulation and the algorithm design seem to be novel and of practical interest. 2. Theoretical regret guarantees are provided for the proposed algorithms.
1. Considering optimal static policy seems to be limited as it can be far from the true dynamic optimal policy. Would it be possible or how difficult it is to extend the current regret analysis in the paper to handle dynamic regret which uses the dynamic optimal policy as benchmark? 2. The regret analysis seems to be straightforward extensions from existing works on UCB-based algorithms for bandit problems as the authors mentioned in Section~4. 3. There is no regret guarantee for the neural line
None
The paper has limited novelty and contribution. 1. The paper simply uses the an existing framework (Bandits with Knapsacks [Badanidiyuru et al., 2018], [Agrawal and Devanur, 2016]) to choose between a model's prediction or to defer to a skilled expert. 2. Limited contribution: - The regret guarantee provided is a straightforward combination of the linear contextual bandit with knapsack guarantee [Agrawal and Devanur, 2016] with the generalized linear bandit analysis from of [Li et al. (2017)],
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications · Blockchain Technology Applications and Security · Information and Cyber Security
