TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains
Wanying Wang, Zeyu Ma, Xuhong Wang, Yangchun Zhang, Pengfei Liu, Mingang Chen

TL;DR
TestAgent introduces an automated framework for dynamic, multi-turn evaluation of LLMs in specialized domains, overcoming static dataset limitations and providing deeper insights into model behavior.
Contribution
It presents a novel, scalable approach combining retrieval-augmented question generation and reinforcement learning-guided interactions for vertical domain evaluation.
Findings
Enables efficient cross-domain benchmark creation
Provides deeper insights into model stability and knowledge boundaries
Demonstrates effectiveness in medical, legal, and governmental domains
Abstract
As Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains, the evaluation of their domain-specific performance becomes critical. However, existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets, which present two key limitations: (i) manual data construction is costly and must be repeated for each new domain, and (ii) static single-turn evaluations are misaligned with the dynamic multi-turn interactions in real-world applications, limiting the assessment of professionalism and stability. To address these, we propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains. TestAgent leverages retrieval-augmented generation to create domain-specific questions from user-provided knowledge sources, combined with a two-stage criteria…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovation and Knowledge Management · Organizational Management and Leadership · Complex Systems and Decision Making
