Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs
Gabriel Bo, Koa Chang, Justin Gu

TL;DR
SPaRK introduces a reinforcement learning method that encourages large language models to explore diverse tool usage, improving reasoning diversity and maintaining high answer quality.
Contribution
It proposes a dual-objective offline RL framework with a rarity-first exploration strategy to enhance tool diversity in LLMs.
Findings
Achieves competitive accuracy across 14 MMLU-Pro categories.
Exhibits higher entropy in tool selection compared to baselines.
Enhances reasoning diversity without sacrificing performance.
Abstract
We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel reinforcement learning framework that teaches large language models to explore diverse tool usage patterns beyond conventional high-temperature sampling. Building on recent advances in step-wise reinforcement learning, we introduce a dual-objective reward system that simultaneously optimizes for answer quality and tool diversity, training a Llama-3.1 8B model through offline PPO on synthetically generated trajectories from the MMLU-Pro dataset. Our approach uniquely employs a rarity-first exploitation strategy where a GPT-4o judge scores candidate actions across eight distinct tools plus chain-of-thought reasoning, with the policy favoring less-frequently used but still viable tools to encourage systematic exploration. Empirical results demonstrate that SPaRK achieves competitive performance across 14 MMLU-Pro…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems · Digital Rights Management and Security · Wikis in Education and Collaboration
MethodsProximal Policy Optimization
