Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

TL;DR
This paper demonstrates that answer matching, which evaluates free-form responses against reference answers using language models, outperforms traditional multiple choice benchmarks in aligning with human judgment for language model evaluation.
Contribution
The paper introduces answer matching as a scalable, more accurate alternative to multiple choice evaluation, showing its near-perfect agreement with human grading.
Findings
Answer matching achieves near-perfect agreement with human grading.
Multiple choice evaluation aligns poorly with human judgment.
Rankings of models change significantly when using answer matching.
Abstract
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
