Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

TL;DR
This paper presents DiaFORGE, a three-stage pipeline that improves enterprise tool-calling LLMs by synthesizing disambiguation dialogues, fine-tuning models, and evaluating real-world performance, significantly enhancing success rates.
Contribution
It introduces DiaFORGE, a novel disambiguation-centric fine-tuning framework, and provides a new benchmark and dataset for enterprise tool invocation in LLMs.
Findings
Models trained with DiaFORGE improve tool-invocation success by up to 49 percentage points.
The framework enhances the realism and reliability of enterprise tool-calling LLMs.
A new dataset of 5000 enterprise API dialogues supports further research.
Abstract
Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
