DeLLMa: Decision Making Under Uncertainty with Large Language Models
Ollie Liu, Deqing Fu, Dani Yogatama, Willie Neiswanger

TL;DR
DeLLMa is a framework that improves decision-making accuracy of large language models under uncertainty by using multi-step reasoning and principles from decision theory, achieving up to 40% better accuracy in complex tasks.
Contribution
We introduce DeLLMa, a novel multi-step reasoning framework that enhances LLM decision-making under uncertainty, integrating decision and utility theories for improved accuracy.
Findings
DeLLMa achieves up to 40% accuracy improvement over existing methods.
Performance improves with increased compute during testing.
Human evaluations validate component effectiveness.
Abstract
The potential of large language models (LLMs) as decision support tools is increasingly being explored in fields such as business, engineering, and medicine, which often face challenging tasks of decision-making under uncertainty. In this paper, we show that directly prompting LLMs on these types of decision-making problems can yield poor results, especially as the problem complexity increases. To aid in these tasks, we propose DeLLMa (Decision-making Large Language Model assistant), a framework designed to enhance decision-making accuracy in uncertain environments. DeLLMa involves a multi-step reasoning procedure that integrates recent best practices in scaling inference-time reasoning, drawing upon principles from decision theory and utility theory, to provide an accurate and human-auditable decision-making process. We validate our procedure on multiple realistic decision-making…
Peer Reviews
Decision·ICLR 2025 Spotlight
- Exposition and writing in general is in good condition. (There are small rooms for improvement in organisation and clarification.) - The whole framework is written with careful formal details, thanks to taking decision-theoretic frame, which is a great idea to start with. - The results on improvements are powerful; also has high impact potential due to popularity of prompt engineering in LLMs. The idea is novel and justifiable (with some inherent difficulties and pitfalls ). - The depth
- Content-wise: Unfortunately, too many reference to appendix in many crucial points. This is of course due to tight page limitation, but authors should seriously think about re-arranging the text more complete. For instance explanations for Table 2 , 3 and 4 really not sufficient (page 10 only call their names without telling what they are). It seems like the paper would serve much better as a journal publication or other venues with more page (e.g., ICML). Related: Say also that Figure 11
- Clearly explain the main concept of DeLLMa as illustrated in Figure 1. - Capture LLM in decision making with a triplet P=(G, A, C) and four steps, where G is the user goal inferred from the description, A is a list of actions, and C is the contextual information. - Include two ranking types: pairwise and top-1. - Provide clear prompts and code in the appendix.
- Inferring the user’s goal from the context is done by the LLM, which requires the user to have a clear goal in mind. For example, if the farmer has not decided to plant only one type of fruit, DeLLMa may not know how to handle an implicit multiple-fruit situation. - The applicable scope of DeLLMa is limited. The experiments include only agriculture and stocks, which may mislead readers into thinking DeLLMa generalises across all decision-making tasks. - There is little information about human
Originality - the paper is quite original, and the proposed pipeline is creative. I have not seen this multistep approach used with LLMs anywhere before. Quality - I found the method to be quite well motivated and well presented. The paper is reproducible, and the experiments seem cohesive and well-motivated. Clarity - The paper is well articulated and clearly structured. Significance - The paper is definitely an important step in the right direction. Uncertainity quantification in ML has bee
1. The context window of LLMs clearly limits the state and action spaces, which makes this method of limited real world applicability. 2. It appears that the method is very sensitive to the first state forecasting step. While the method is robust in general, I would like to understand if this is indeed a big bottleneck. 3. The utility elicitaiton method proposed (pairwise ranking) seems a bit too simple for real world scenarios. I would like to know if other methods were tried for this.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · AI-based Problem Solving and Planning
