Benchmarking Foundation Models with Language-Model-as-an-Examiner
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang,, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei, Hou

TL;DR
This paper introduces a novel benchmarking framework where language models act as examiners, generating and evaluating questions across diverse domains to assess their understanding and language capabilities more reliably and comprehensively.
Contribution
The paper proposes a flexible, reference-free benchmarking framework using LMs as examiners, with strategies for diverse questioning, combined evaluation metrics, and decentralized peer review.
Findings
Effective in broad domain coverage and in-depth assessment.
Aligns closely with human evaluation results.
Addresses biases through peer-examination.
Abstract
Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
