StructEval: Deepen and Broaden Large Language Model Assessment via   Structured Evaluation

Boxi Cao; Mengjie Ren; Hongyu Lin; Xianpei Han; Feng Zhang; Junfeng; Zhan; Le Sun

arXiv:2408.03281·cs.CL·August 8, 2024

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

Boxi Cao, Mengjie Ren, Hongyu Lin, Xianpei Han, Feng Zhang, Junfeng, Zhan, Le Sun

PDF

Open Access 1 Repo

TL;DR

StructEval introduces a comprehensive evaluation framework for large language models that assesses multiple cognitive levels and concepts, improving reliability and robustness over traditional single-item tests.

Contribution

The paper presents StructEval, a novel structured evaluation framework that enhances LLM assessment by covering diverse cognitive levels and concepts, reducing bias and data contamination.

Findings

01

StructEval provides more reliable evaluation results.

02

It demonstrates robustness against data contamination.

03

It offers insights for designing future evaluation protocols.

Abstract

Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

c-box/structeval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling