Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs

Gabriel Bo; Koa Chang; Justin Gu

arXiv:2507.11371·cs.LG·July 16, 2025

Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs

Gabriel Bo, Koa Chang, Justin Gu

PDF

Open Access 1 Repo

TL;DR

SPaRK introduces a reinforcement learning method that encourages large language models to explore diverse tool usage, improving reasoning diversity and maintaining high answer quality.

Contribution

It proposes a dual-objective offline RL framework with a rarity-first exploration strategy to enhance tool diversity in LLMs.

Findings

01

Achieves competitive accuracy across 14 MMLU-Pro categories.

02

Exhibits higher entropy in tool selection compared to baselines.

03

Enhances reasoning diversity without sacrificing performance.

Abstract

We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel reinforcement learning framework that teaches large language models to explore diverse tool usage patterns beyond conventional high-temperature sampling. Building on recent advances in step-wise reinforcement learning, we introduce a dual-objective reward system that simultaneously optimizes for answer quality and tool diversity, training a Llama-3.1 8B model through offline PPO on synthetically generated trajectories from the MMLU-Pro dataset. Our approach uniquely employs a rarity-first exploitation strategy where a GPT-4o judge scores candidate actions across eight distinct tools plus chain-of-thought reasoning, with the policy favoring less-frequently used but still viable tools to encourage systematic exploration. Empirical results demonstrate that SPaRK achieves competitive performance across 14 MMLU-Pro…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gabrielkmbo/explore-rl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLibrary Science and Information Systems · Digital Rights Management and Security · Wikis in Education and Collaboration

MethodsProximal Policy Optimization