BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models
Yu Feng, Ben Zhou, Weidong Lin, Dan Roth

TL;DR
BIRD is a new probabilistic inference framework that leverages large language models to produce more accurate and trustworthy probability estimates, improving decision-making in real-world tasks.
Contribution
BIRD introduces a novel method combining Bayesian networks with LLM abductions to enhance probabilistic estimation accuracy.
Findings
BIRD achieves 30% better probability estimation accuracy than LLM baselines.
The framework improves trustworthiness of decision-making processes.
It effectively aligns Bayesian networks with LLM-generated abductive factors.
Abstract
Predictive models often need to work with incomplete information in real-world tasks. Consequently, they must provide reliable probability or confidence estimation, especially in large-scale decision-making and planning tasks. Current large language models (LLMs) are insufficient for accurate estimations, but they can generate relevant factors that may affect the probabilities, produce coarse-grained probabilities when the information is more complete, and help determine which factors are relevant to specific downstream contexts. In this paper, we make use of these capabilities of LLMs to provide a significantly more accurate probabilistic estimation. We propose BIRD, a novel probabilistic inference framework that aligns a Bayesian network with LLM abductions and then estimates more accurate probabilities in a deduction step. We show BIRD provides reliable probability estimations that…
Peer Reviews
Decision·ICLR 2025 Oral
1. The proposed Bayesian inference framework is conceptually interesting. Factoring the overall decision-making into multiple fine-grained factors is intuitive and useful for mitigating LLMs' limitations in providing accurate fine-grained probability/confidence estimation. 2. The experiments performed are carefully designed, and the overall writing is clear and with enough details. 3. The experiment results demonstrate the effectiveness of BIRD, showing its value in multiple application scenar
1. The training setting for "Learning algorithm for constrained optimization to estimate $P(O_i|f)$" (Line 266) should be more clearly described. The current description does not clarify (1) the number of learnable parameters ($P(O_i|f_j)$); (2) whether different instances share the same factors $f_j$; (3) whether there are $f_j$ in the test data that are unseen in the training data. 2. In the experiment for "Applying BIRD’s Probabilities in Decision-Making" (Line 430), the test instances on wh
This paper presented a very novel idea, a detailed method for achieving this, and a convincing evaluation of the utility of the technique. Specific strengths include: - mechanism for prompting a LLM to extract factors & potential outcomes of a complex decision problem - means to convert the factors into a binary Bayes network - incorporation of model uncertainty estimates into the parameterisation of the Bayes network - evaluation against human judgements of probability - strong baselines
The paper was quite dense to start, and took several pages to make the application scenario clear. On my first pass it wasn't apparent to me what elements of the data were part of the problem definition, what parts were computed by the LLM, and what was computed by inference in a Bayes net. Fig 1 is reasonably easy to follow, but Fig 2 is less clear - why have the conditions changed between Fig 1 and Fig2, and what parts of the central box are given as part of the data versus inferred from LLM i
1) The proposed framework and optimization settings, which account for weak ordering and non-interaction in the set of factors, are valid. 2) The authors' assumptions about enhancing the validity of the proposed Bayesian Inference Framework are reasonable. 3) Including unobserved factors in the calculations strengthens the effectiveness and utility of the authors' proposed method. 4) Besides precise probability estimation, the presence of follow-up generation and cross-domain experiments demonst
1) In Line 147 and Line 155, the explanation of the outcomes is separated. It seems possible to consolidate these explanations into one part. Additionally, the explanation of 𝐹 in the same paragraph could be made more concise. The space saved from these revisions could then be used to more clearly elaborate on the connection between Equation (5) and Equation (6) and the content in Algorithm 1. 2) Ablation settings could be expanded to cover more aspects. For example, experiments could be conduc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam
