TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains

Wanying Wang; Zeyu Ma; Xuhong Wang; Yangchun Zhang; Pengfei Liu; Mingang Chen

arXiv:2410.11507·cs.AI·September 26, 2025

TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains

Wanying Wang, Zeyu Ma, Xuhong Wang, Yangchun Zhang, Pengfei Liu, Mingang Chen

PDF

Open Access

TL;DR

TestAgent introduces an automated framework for dynamic, multi-turn evaluation of LLMs in specialized domains, overcoming static dataset limitations and providing deeper insights into model behavior.

Contribution

It presents a novel, scalable approach combining retrieval-augmented question generation and reinforcement learning-guided interactions for vertical domain evaluation.

Findings

01

Enables efficient cross-domain benchmark creation

02

Provides deeper insights into model stability and knowledge boundaries

03

Demonstrates effectiveness in medical, legal, and governmental domains

Abstract

As Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains, the evaluation of their domain-specific performance becomes critical. However, existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets, which present two key limitations: (i) manual data construction is costly and must be repeated for each new domain, and (ii) static single-turn evaluations are misaligned with the dynamic multi-turn interactions in real-world applications, limiting the assessment of professionalism and stability. To address these, we propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains. TestAgent leverages retrieval-augmented generation to create domain-specific questions from user-provided knowledge sources, combined with a two-stage criteria…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovation and Knowledge Management · Organizational Management and Leadership · Complex Systems and Decision Making