The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of   Language Models with Language Models

Seungone Kim; Juyoung Suk; Ji Yong Cho; Shayne Longpre; Chaeeun Kim,; Dongkeun Yoon; Guijin Son; Yejin Cho; Sheikh Shafayat; Jinheon Baek; Sue Hyun; Park; Hyeonbin Hwang; Jinkyung Jo; Hyowon Cho; Haebin Shin; Seongyun Lee,; Hanseok Oh; Noah Lee; Namgyu Ho; Se June Joo; Miyoung Ko; Yoonjoo Lee,; Hyungjoo Chae; Jamin Shin; Joel Jang; Seonghyeon Ye; Bill Yuchen Lin; Sean; Welleck; Graham Neubig; Moontae Lee; Kyungjae Lee; Minjoon Seo

arXiv:2406.05761·cs.CL·March 26, 2025

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim,, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun, Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee,, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo

PDF

Open Access 2 Repos 2 Datasets

TL;DR

The BiGGen Bench provides a detailed, capability-specific evaluation framework for language models, addressing limitations of existing benchmarks by using instance-specific criteria across diverse tasks.

Contribution

It introduces a novel, principled benchmark with instance-specific evaluation for comprehensive, fine-grained assessment of language models' capabilities.

Findings

01

Evaluated 103 language models using the benchmark.

02

Demonstrated the effectiveness of instance-specific evaluation criteria.

03

Provided publicly available code and data for reproducibility.

Abstract

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus