Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever
Yixin Chen, Ying Xiong, Shangyu Wu, Yufei Cui, Xue Liu, Nan Guan, Chun Jason Xue

TL;DR
This paper introduces a behavior-aligned retriever (BAR) that improves tool-augmented LLMs by providing behaviorally consistent demonstrations, reducing unnecessary API calls, and lowering costs without sacrificing task performance.
Contribution
The paper presents a novel BAR trained with contrastive learning to ensure behaviorally consistent demonstrations, addressing limitations of existing fine-tuning and prompting methods.
Findings
Significantly reduces erroneous function calls
Maintains high task performance
Offers a cost-effective solution for tool-augmented LLMs
Abstract
Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear problem formulation and motivation. The paper convincingly shows that semantic-only retrieval can mislead tool decisions and empirically correlates behavior-consistency with downstream quality. 2. Methodological simplicity with practical impact. BAR is a lightweight retriever that can be dropped into existing RAG-for-tools pipelines without re-training the backbone LLM; the contrastive formulation is clean and reproducible (hyperparameters, alpha, tau, batch/epochs are provided). 3.
1. Evaluation reliance on LLM judges / proxies. Parts of Harmlessness/Helpfulness rely on GPT-4-style judgment prompts or indirect metrics; robustness to judge variance and prompt framing is not analyzed. Statistical significance and inter-rater reliability are missing. 2. Cost & latency not quantified. The core pitch is "reducing unnecessary API calls -> efficiency", but there is no end-to-end accounting of latency, monetary cost, or energy savings across workloads. Gains are reported as rate
1. The paper identifies an underexplored but important issue in retrieval-augmented function-calling that semantically close examples can have opposite tool behaviors. 2. The proposed dual-negative contrastive loss to enforce behavioral consistency is a solid and interesting technical contribution. 3. The experiments show consistent gains across multiple datasets and models, indicating robustness and general applicability.
1. Clarity of Figure 1 and Motivation The description around Figure 1 is currently confusing. The text says that “Many queries that are lexically or topically close require opposite tool-invocation behavior. Figure 1 illustrates such a clash…”, but the figure only shows a general retrieval pipeline rather than the clash itself. The authors should explicitly annotate Figure 1 (or provide an additional subfigure) to illustrate a concrete example of the mismatch: e.g., “What’s the weather in Paris?
1. Augmenting tool-use agents with a plug-in retriever is elegant and efficient. The LLMs can be improved without the cost- and time-intensive post-training. Compared with LLM post-training, fine-tuning a smaller retriever is both cheaper and faster. 2. This paper is well-organized with clear formulation and writing structure, which makes it easy for the reader to understand.
1. In-context example selection is a well-known technique. Although this paper adopt this to augment tool-use agents and achieve good performance, the novelty concerns still remain. Could the author explain the main contributions of the techniques? 2. It seems like a customized example corpus has been built first, thereby enabling the retriever to retrieve from it. However, the details of building such a corpus are unclear. I suggest that the author provide more explanation. 3. The author only
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Service-Oriented Architecture and Web Services · Advanced Text Analysis Techniques
