Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

Hyo Jin Do; Zahra Ashktorab; Jasmina Gajcin; Erik Miehling; Mart\'in Santill\'an Cooper; Qian Pan; Elizabeth M. Daly; Werner Geyer

arXiv:2511.04478·cs.HC·November 7, 2025

Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

Hyo Jin Do, Zahra Ashktorab, Jasmina Gajcin, Erik Miehling, Mart\'in Santill\'an Cooper, Qian Pan, Elizabeth M. Daly, Werner Geyer

PDF

Open Access

TL;DR

This paper introduces a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, enabling efficient, customizable, and transparent creation of test cases to improve evaluation and alignment of language models.

Contribution

The paper presents a novel tool that combines synthetic data generation with user-controlled customization and transparency, enhancing the effectiveness of LLM-based evaluation methods.

Findings

01

Participants preferred the tool for rapid, diverse data generation.

02

Synthetic data was as effective as hand-crafted data for evaluation.

03

The approach improves scalability and efficiency in model assessment.

Abstract

The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but its effectiveness is often limited by the scarcity of diverse, representative data for refining criteria. We present a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, empowering users to create tailored and challenging test cases with configurable domains, personas, lengths, and desired outcomes, including borderline cases. The tool also supports AI-assisted inline editing of existing test cases. To enhance transparency and interpretability, it reveals the prompts and explanations behind each generation. In a user study (N=24), 83% of participants preferred the tool over manually creating or selecting test cases, as it allowed them to rapidly generate diverse synthetic data without additional workload. The generated synthetic data proved as effective as hand-crafted data for both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Topic Modeling