Benchmarking Foundation Models with Language-Model-as-an-Examiner

Yushi Bai; Jiahao Ying; Yixin Cao; Xin Lv; Yuze He; Xiaozhi Wang,; Jifan Yu; Kaisheng Zeng; Yijia Xiao; Haozhe Lyu; Jiayin Zhang; Juanzi Li; Lei; Hou

arXiv:2306.04181·cs.CL·November 7, 2023·21 cites

Benchmarking Foundation Models with Language-Model-as-an-Examiner

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang,, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei, Hou

PDF

Open Access

TL;DR

This paper introduces a novel benchmarking framework where language models act as examiners, generating and evaluating questions across diverse domains to assess their understanding and language capabilities more reliably and comprehensively.

Contribution

The paper proposes a flexible, reference-free benchmarking framework using LMs as examiners, with strategies for diverse questioning, combined evaluation metrics, and decentralized peer review.

Findings

01

Effective in broad domain coverage and in-depth assessment.

02

Aligns closely with human evaluation results.

03

Addresses biases through peer-examination.

Abstract

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus