
TL;DR
This paper examines the reliability of Bayesian evidence ratios for model comparison, highlighting the variability in evidence and advocating for decision thresholds that consider error probabilities, with historical insights from Turing.
Contribution
It analyzes the statistical properties of evidence ratios, emphasizing the importance of decision thresholds and error trade-offs in Bayesian model selection.
Findings
Evidence ratios can vary widely across data realizations.
Relying solely on Jeffrey's scale may be misleading for decisions.
Considering error probabilities improves decision reliability.
Abstract
Bayesian evidence ratios give a very attractive way of comparing models, and being able to quote the odds on a particular model seems a very clear motivation for making a choice. Jeffreys' scale of evidence is often used in the interpretation of evidence ratios. A natural question is, how often will you get it right when you choose on the basis of some threshold value of the evidence ratio? The evidence ratio will be different in different realizations of the data, and its utility can be examined in a Neyman-Pearson like way to see what the trade-offs are between statistical power (the chance of ``getting it right'') versus the false alarm rate, picking the alternative hypothesis when the null is actually true. I will show some simple examples which show that there can be a surprisingly large range for an evidence ratio under different realizations of the data. It seems best not to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
