Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM   Evaluation

Siyuan Wang; Zhuohan Long; Zhihao Fan; Zhongyu Wei; Xuanjing Huang

arXiv:2402.11443·cs.CL·February 20, 2024·2 cites

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-agent framework for dynamically evaluating large language models by creating evolving benchmark instances, revealing more nuanced performance insights and limitations.

Contribution

The novel framework uses multi-agent operations to generate evolving benchmark instances, enabling scalable, robust, and fine-grained evaluation of LLMs' capabilities.

Findings

01

Most LLMs show performance decline against original results.

02

Evaluation reveals wider performance gaps between models and tasks.

03

Framework provides more accurate and detailed model assessments.

Abstract

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nanshineloong/self-evolving-benchmark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Multi-Agent Systems and Negotiation