STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models

Jiliang Ni; Jiachen Pu; Zhongyi Yang; Jingfeng Luo; Conggang Hu

arXiv:2602.03022·cs.AI·February 25, 2026

STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models

Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu

PDF

Open Access 3 Reviews

TL;DR

STAR introduces a novel training framework that effectively transfers large language models' capabilities into super-tiny models for function calling, achieving state-of-the-art performance with stability and efficiency.

Contribution

The paper presents Constrained Knowledge Distillation and Similarity-guided RL, innovative techniques that enhance training stability and policy optimization in tiny models.

Findings

01

0.6B STAR model outperforms all open models under 1B.

02

Achieves state-of-the-art results on function calling benchmarks.

03

Demonstrates effective knowledge transfer from large to tiny models.

Abstract

The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones. However, existing paradigms are often plagued by overfitting, training instability, ineffective binary rewards for multi-solution tasks, and the difficulty of synergizing techniques. We introduce STAR: Similarity-guided Teacher-Assisted Refinement, a novel holistic framework that effectively transfers LLMs' capabilities to super-tiny models. STAR consists of two core technical innovations: (1) Constrained Knowledge Distillation (CKD), a training objective that augments top-k forward KL divergence to suppress confidently incorrect predictions, ensuring training stability while preserving exploration capacity for downstream RL. STAR holistically synergizes these…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* The paper provides targeted and practical improvements for enhancing tool-calling capability, offering insights that could be useful for others working in this area. The performance gains are solid and well-demonstrated through experiments

Weaknesses

* The framework is sound and the empirical results are solid; however, the methodological contribution is not particularly significant.

Reviewer 02Rating 8Confidence 4

Strengths

- The paper is generally well-written and well motivated - Experimental results are promising and show strong generalization compared to baselines - The method seems to simple and effective at mitigating overfitting problem especially when compared to conventional approach of SFT+RL - Results on closing the performance gap with much stronger models in Table 3 is pretty interesting - Also appreciate additional theoretical explanation for their Top-K truncation with FK

Weaknesses

- There’s no direct comparison with existing RL-based methods. It’s not clear how the proposed reward is different from those proposed in related prior works for example one in [1]. In general more comparison with existing RL rewards would be nice. (see more in questions) [1] Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D Manning. Synthetic data generation and multi-step reinforcement learning for reasoning and tool use.

Reviewer 03Rating 6Confidence 3

Strengths

- The paper is relatively well-written and easy to follow. - The performance gains on the 0.6B model scale is consistent over most metrics in the two evaluation benchmarks, and the performance gap with larger is significantly reduced. - The paper includes discussion on the comparison between KD+RL and SFT+RL besides the empirical results that might be insightful for future work. - The paper includes analysis on the comparison among different KD methods.

Weaknesses

- Sim-RL looks highly dependent on the generated function calls’ format that the author defines based on the Qwen tool calling template. It might be important to show the generalizability of this method for alternative formats. - STAR requires RL training on the teacher model, which introduces additional non-trivial compute cost compared to SFT+RL. - More potential analysis studies could improve the persuasiveness of the paper in showing the advantages of CKD over SFT. For example, a comparison

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multimodal Machine Learning Applications