Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

Ruchira Dhar; Anders S{\o}gaard

arXiv:2604.25923·cs.CL·April 30, 2026

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

Ruchira Dhar, Anders S{\o}gaard

PDF

TL;DR

This paper reviews and categorizes evaluation concerns in NLP, providing a taxonomy and practical checklist to improve evaluation practices, especially in the context of large language models.

Contribution

It offers a comprehensive taxonomy of evaluation concerns in NLP, integrating historical perspectives and practical tools for better evaluation design and interpretation.

Findings

01

Developed a taxonomy of evaluation concerns in NLP.

02

Synthesized recurring positions and trade-offs in evaluation.

03

Provided a structured checklist for evaluation practices.

Abstract

Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.