An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments
Naghmeh Farzi, Laura Dietz

TL;DR
This paper introduces a novel evaluation method for information retrieval and generation systems that relies on exam questions and answerability rather than traditional relevance judgments, enabling more flexible and ongoing assessment.
Contribution
It proposes the EXAM Answerability Metric and a new paradigm for IR evaluation that does not depend on relevance judgments, using exam questions and answerability as core concepts.
Findings
Developed the EXAM Answerability Metric for system evaluation.
Introduced two measures: EXAM Cover and EXAM Qrels.
Enabled post-hoc expansion and continuous evaluation of systems.
Abstract
Current IR evaluation is based on relevance judgments, created either manually or automatically, with decisions outsourced to Large Language Models (LLMs). We offer an alternative paradigm, that never relies on relevance judgments in any form. Instead, a text is defined as relevant if it contains information that enables the answering of key questions. We use this idea to design the EXAM Answerability Metric to evaluate information retrieval/generation systems for their ability to provide topically relevant information. We envision the role of a human judge to edit and define an exam question bank that will test for the presence of relevant information in text. We support this step by generating an initial set of exam questions. In the next phase, an LLM-based question answering system will automatically grade system responses by tracking which exam questions are answerable with which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment
MethodsSparse Evolutionary Training · Focus
