MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards

ChangSu Choi; Hoyun Song; Dongyeon Kim; WooHyeon Jung; Minkyung Cho; Sunjin Park; NohHyeob Bae; Seona Yu; KyungTae Lim

arXiv:2510.18383·cs.CL·October 29, 2025

MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

PDF

TL;DR

MENTOR is a reinforcement learning framework that enhances small language models' tool use by combining teacher-guided dense rewards with exploration, leading to better generalization and strategic skills.

Contribution

It introduces a novel RL-based method that integrates teacher-guided dense rewards to improve small models' robustness and tool-using capabilities.

Findings

01

Significant improvement in cross-domain generalization.

02

Enhanced strategic competence of small models.

03

Outperforms supervised fine-tuning and standard RL baselines.

Abstract

Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.