Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Ashutosh Hathidara; Julien Yu; Sebastian Schreiber

arXiv:2507.03336·cs.AI·April 14, 2026

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

PDF

1 Datasets

TL;DR

This paper presents DiaFORGE, a three-stage pipeline that improves enterprise tool-calling LLMs by synthesizing disambiguation dialogues, fine-tuning models, and evaluating real-world performance, significantly enhancing success rates.

Contribution

It introduces DiaFORGE, a novel disambiguation-centric fine-tuning framework, and provides a new benchmark and dataset for enterprise tool invocation in LLMs.

Findings

01

Models trained with DiaFORGE improve tool-invocation success by up to 49 percentage points.

02

The framework enhances the realism and reliability of enterprise tool-calling LLMs.

03

A new dataset of 5000 enterprise API dialogues supports further research.

Abstract

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SAP/diaforge-utc-r-0725
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.