Rational Decision-Making Agent with Internalized Utility Judgment
Yining Ye, Xin Cong, Shizuo Tian, Yujia Qin, Chong Liu, Yankai Lin, Zhiyuan Liu, Maosong Sun

TL;DR
RadAgent is a novel decision-making framework for large language models that internalizes utility judgments through iterative learning, enabling autonomous, rational decisions without relying on external metrics, and demonstrating superior performance on diverse tasks.
Contribution
The paper introduces RadAgent, which develops internal utility judgments via Elo-based scoring, improving autonomous decision-making in LLMs beyond external metric guidance.
Findings
RadAgent outperforms baselines with over 10% higher Pass Rate.
It achieves higher-quality solutions and reduces API call costs.
Experimental results validate the effectiveness of internal utility learning.
Abstract
Large language models (LLMs) have demonstrated remarkable advancements and have attracted significant efforts to develop LLMs into agents capable of executing intricate multi-step decision-making tasks beyond traditional NLP applications. Existing approaches to LLM-based decision-making predominantly build upon the manually-designed external performance metrics to guide the decision-making process. However, reliance on the external performance metrics as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision making, it is imperative for the agent to develop its rationality from its posterior experiences to judge decisions independently. Central to the development of rationality is the construction of an internalized utility judgment, capable of assigning numerical utilities to each decision. This…
Peer Reviews
Decision·Submitted to ICLR 2024
I am not aware of any existing approaches resembling the one proposed in the paper. The key idea of backing out elo rankings using pairwise evaluations from an LLM is intuitively appealing, but subtle enough that the submission merits credit for originality. The submission is mostly clear and mostly well-written. LLM decision making is an important problem and the submission's results are strong relative to existing previous works.
### Clarity Issues Below I list a couple of places I had trouble following along with some explanations. --- > In contrast, RADAGENT assigns lower scores to fewer potential decision steps, displaying a trend for exploring novel avenues, which exemplifies a scenario demanding diversity in exploration. I wasn't able to parse this sentence. --- In RQ5, I feel there could be some more guidance to the reader. It seems like RaDAgent has the highest or tied for highest incidence ratio for both Ha
1. The paper builds on known principles on sequential decision making 2. The results presented have a sizeable performance lead over the baselines
I think that the paper has a few weaknesses in some key areas that might limit its applicability. Also, the empirical evaluation section was a bit confusing to read. A bit of reorganization here might help. 1. The paper mentions (Sec 3.2) the initial elo scores for a decision sequence (iteration 1) are fixed. How are comparisons performed? IE how is a "win" determined? I assume it is binary (task completed or not). In this case, are two decision sequences where both are "wins" but one is long
1. The paper is articulate, presenting high-quality content that is easy to follow. 2. The methodology proposed is novel. By leveraging LLM’s inherent capability for value assessment, it pioneers a way to guide decision-making without the need for manually tailored prompts for value evaluation. 3. The experimental results robustly corroborate the efficacy of the proposed method. RaDAgent consistently surpasses the benchmarks in various tasks. Moreover, the correlation observed between Elo scores
1. The revelation of the experimental details is inadequate. Notably, not all prompts used are revealed, and no examples are provided. Such omissions pose a challenge for reproducibility and make it difficult to pinpoint the source of the observed improvements. 2. There's a conspicuous absence of an ablation study concerning the Elo-based value evaluation. By not contrasting it with a manual value evaluation prompt, it remains ambiguous whether the observed performance boost arises from the Elo-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
