Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models
Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding,, Huaimin Wang

TL;DR
This paper introduces a step-level Q-value modeling approach for LLM agents, significantly improving decision-making performance by guiding action selection with learned value estimates, applicable across various agents and tasks.
Contribution
The paper proposes a novel method to estimate step-level Q-values for LLM agents using Monte Carlo Tree Search and direct policy optimization, enhancing multi-step decision-making.
Findings
Q-value models improved agent performance by over 100% on WebShop.
Q-value models increased HotPotQA performance by 75%.
Method generalizes across different LLM agents and tasks.
Abstract
Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Service-Oriented Architecture and Web Services
