AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards
Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He

TL;DR
This paper introduces AWPO, a reinforcement learning framework that adaptively integrates reasoning rewards into large language models to enhance tool-use performance, achieving state-of-the-art results with high parameter efficiency.
Contribution
We propose AWPO, a novel RL method that effectively combines reasoning and outcome rewards through adaptive advantage estimation for improved tool utilization in LLMs.
Findings
AWPO outperforms strong baselines on standard benchmarks.
A 4B model with AWPO surpasses Grok-4 by 16% in multi-turn accuracy.
AWPO maintains generalization on out-of-distribution tasks.
Abstract
While Reinforcement Learning (RL) shows promise in training tool-use Large Language Models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of reasoning rewards based on chain-of-thought quality for better tool utilization. Furthermore, na\"ively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose Advantage-Weighted Policy Optimization (AWPO), a principled RL framework that adaptively integrates reasoning rewards into advantage estimation to improve tool-use performance. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
