One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang

TL;DR
One-Eval is an agentic system that automates and makes LLM evaluation traceable by converting natural language requests into customizable, executable workflows with integrated benchmarks, metrics, and human review.
Contribution
It introduces a novel system that automates LLM evaluation workflows from natural language requests, enhancing reproducibility and traceability in industrial settings.
Findings
Supports end-to-end evaluation with minimal effort
Enables reproducible and customizable evaluation workflows
Includes human-in-the-loop checkpoints for review
Abstract
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Scientific Computing and Data Management · Natural Language Processing Techniques
