Aligning the Evaluation of Probabilistic Predictions with Downstream Value
Novin Shahroudi, Viacheslav Komisarenko, Meelis Kull

TL;DR
This paper introduces a data-driven, neural network-based method to align probabilistic prediction evaluation metrics with downstream task performance, addressing the mismatch between traditional predictive metrics and real-world utility.
Contribution
It proposes a novel approach using weighted scoring rules and neural networks to learn proxy evaluation functions that better reflect downstream impact, building on proper scoring rule theory.
Findings
The method effectively aligns evaluation metrics with downstream utility in synthetic experiments.
It demonstrates scalability and adaptability across different regression tasks.
The approach reduces the need for multiple task-specific metrics and explicit cost structures.
Abstract
Every prediction is ultimately used in a downstream task. Consequently, evaluating prediction quality is more meaningful when considered in the context of its downstream use. Metrics based solely on predictive performance often diverge from measures of real-world downstream impact. Existing approaches incorporate the downstream view by relying on multiple task-specific metrics, which can be burdensome to analyze, or by formulating cost-sensitive evaluations that require an explicit cost structure, typically assumed to be known a priori. We frame this mismatch as an evaluation alignment problem and propose a data-driven method to learn a proxy evaluation function aligned with the downstream evaluation. Building on the theory of proper scoring rules, we explore transformations of scoring rules that ensure the preservation of propriety. Our approach leverages weighted scoring rules…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
