Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

Chaitanya Malaviya; Joseph Chee Chang; Dan Roth; Mohit Iyyer; Mark Yatskar; Kyle Lo

arXiv:2411.07237·cs.CL·May 27, 2025

Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

Chaitanya Malaviya, Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, Kyle Lo

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a method called contextualized evaluations that constructs specific contexts around underspecified queries to improve the assessment of language model responses, revealing how context influences evaluation outcomes and model behavior.

Contribution

The paper proposes a novel evaluation protocol that incorporates synthetic context, enabling more accurate and insightful assessments of language models on underspecified queries.

Findings

01

Context significantly affects evaluation outcomes, sometimes reversing model rankings.

02

Contextualized evaluation reduces surface-level bias in judgments.

03

Models show varied sensitivity to different contextual cues.

Abstract

Language model users often issue queries that lack specification, where the context under which a query was issued -- such as the user's identity, the query's intent, and the criteria for a response to be useful -- is not explicit. For instance, a good response to a subjective query like "What book should I read next?" would depend on the user's preferences, and a good response to an open-ended query like "How do antibiotics work against bacteria?" would depend on the user's expertise. This makes evaluation of responses to such queries an ill-posed task, as evaluators may make arbitrary judgments about the response quality. To remedy this, we present contextualized evaluations, a protocol that synthetically constructs context surrounding an underspecified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

allenai/ContextEval
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvaluation and Performance Assessment