One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Chengyu Shen; Yanheng Hou; Minghui Pan; Runming He; Zhen Hao Wong; Meiyi Qiang; Zhou Liu; Hao Liang; Peichao Lai; Zeang Sheng; Wentao Zhang

arXiv:2603.09821·cs.CL·March 11, 2026

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang

PDF

Open Access

TL;DR

One-Eval is an agentic system that automates and makes LLM evaluation traceable by converting natural language requests into customizable, executable workflows with integrated benchmarks, metrics, and human review.

Contribution

It introduces a novel system that automates LLM evaluation workflows from natural language requests, enhancing reproducibility and traceability in industrial settings.

Findings

01

Supports end-to-end evaluation with minimal effort

02

Enables reproducible and customizable evaluation workflows

03

Includes human-in-the-loop checkpoints for review

Abstract

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Scientific Computing and Data Management · Natural Language Processing Techniques