Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
Ruchira Dhar, Anders S{\o}gaard

TL;DR
This paper reviews and categorizes evaluation concerns in NLP, providing a taxonomy and practical checklist to improve evaluation practices, especially in the context of large language models.
Contribution
It offers a comprehensive taxonomy of evaluation concerns in NLP, integrating historical perspectives and practical tools for better evaluation design and interpretation.
Findings
Developed a taxonomy of evaluation concerns in NLP.
Synthesized recurring positions and trade-offs in evaluation.
Provided a structured checklist for evaluation practices.
Abstract
Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
