TL;DR
SynAE is a comprehensive framework designed to evaluate the quality of synthetic datasets used for testing tool-calling agents, ensuring they accurately replicate real data characteristics across multiple metrics.
Contribution
This work introduces SynAE, a novel multi-metric evaluation framework for synthetic data quality in tool-calling agent assessments, addressing limitations of single-metric approaches.
Findings
SynAE effectively detects variations in data validity, fidelity, and diversity.
No single metric suffices to fully characterize synthetic data quality.
Multi-axis evaluation provides a more comprehensive assessment of synthetic data.
Abstract
Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
