AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards

Zihan Lin; Xiaohan Wang; Hexiong Yang; Jiajun Chai; Jie Cao; Guojun Yin; Wei Lin; Ran He

arXiv:2512.19126·cs.CL·January 16, 2026

AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He

PDF

Open Access

TL;DR

This paper introduces AWPO, a reinforcement learning framework that adaptively integrates reasoning rewards into large language models to enhance tool-use performance, achieving state-of-the-art results with high parameter efficiency.

Contribution

We propose AWPO, a novel RL method that effectively combines reasoning and outcome rewards through adaptive advantage estimation for improved tool utilization in LLMs.

Findings

01

AWPO outperforms strong baselines on standard benchmarks.

02

A 4B model with AWPO surpasses Grok-4 by 16% in multi-turn accuracy.

03

AWPO maintains generalization on out-of-distribution tasks.

Abstract

While Reinforcement Learning (RL) shows promise in training tool-use Large Language Models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of reasoning rewards based on chain-of-thought quality for better tool utilization. Furthermore, na\"ively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose Advantage-Weighted Policy Optimization (AWPO), a principled RL framework that adaptively integrates reasoning rewards into advantage estimation to improve tool-use performance. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications