Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta; Edmund Mills; Naman Garg; Sumeet Motwani and; Chelsea Finn; Divyansh Garg; Rafael Rafailov

arXiv:2408.07199·cs.AI·August 15, 2024·5 cites

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani and, Chelsea Finn, Divyansh Garg, Rafael Rafailov

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces a novel framework combining guided Monte Carlo Tree Search, self-critique, and iterative fine-tuning with DPO to enhance LLM-based autonomous agents in complex, multi-step reasoning tasks, significantly improving their performance in dynamic environments.

Contribution

The paper presents a new method that integrates MCTS, self-critique, and off-policy fine-tuning to enable LLM agents to learn from diverse trajectories, outperforming existing approaches in interactive environments.

Findings

01

Outperforms behavior cloning and reinforced fine-tuning baselines.

02

Achieves 81.7% success rate in web navigation after one day of data collection.

03

Boosts Llama-3 70B zero-shot performance from 18.6% to 81.7%.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper proposes a method for online fine-tuning a web navigation agent and the empirical results show that the proposed method outperforms the baseline methods. - The paper evaluates Agent Q on both simulated and real-world tasks, demonstrating its effectiveness in various scenarios.

Weaknesses

- The idea is not novel enough since Agent Q uses a combination of DPO, MCTS, and process supervision for web navigation task. - Lack of experimental details, such as hyperparameters, and the significance of the results. Also, the code is not provided. - Lack of ablation studies to analyze the effects of number of iterations in Agent Q. Format and writing issues: - The paper cites papers using wrong LaTeX commands. For example, "(Zhou et al., 2024c) has shown success formulating the RL problem

Reviewer 02Rating 5Confidence 4

Strengths

- The proposed approach is mostly sound (except for the open-loop planning component). MCTS can be an effective method for gathering trajectory data for agent training, and preference-based learning could be useful in web agent training scenarios. - Overall, the manuscript is clear and easy to follow. The figures are made well for fair overviews. The writing is clear in general.

Weaknesses

- The proposed approach generates a plan at the beginning of each trajectory, which seems to stay frozen for the rest. This open-loop planning can be a bottleneck for improving agents' capabilities in more general, complex scenarios. - Despite the expressions such as "real-world booking scenarios" (Abstract) and "Scaling To Real World Websites" (Section 6), the proposed OpenTable experiment doesn't seem ideal for testing the proposed approach's "real-world" capabilities. The set of tasks are gen

Reviewer 03Rating 5Confidence 5

Strengths

The proposed method is successfully applied to the real-world booking scenario and achieve a $95.4\%$ success rate after a single day of data collection.

Weaknesses

The experimental section is incomplete. The ablation study is entirely missing, providing no concrete indications of the strength of each component or design choice, such as the self-critique mechanism, the four types of actions, the effects of training sample sizes, and the weighted coefficient $\alpha$. While I understand the challenges of conducting experiments in real scenarios, ablations in the WebShop context are essential to validate the method's effectiveness. Furthermore, there is no an

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Logic, Reasoning, and Knowledge