Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Khloud AL Jallad; Nada Ghneim; Ghaida Rebdawi

arXiv:2507.20419·cs.CL·July 29, 2025

Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi

PDF

TL;DR

This survey reviews existing NLU benchmarks with diagnostics datasets across multiple languages, highlighting the lack of standardization in linguistic phenomena categorization and evaluation metrics, and advocates for unified diagnostics standards.

Contribution

It provides a comprehensive comparison of NLU benchmarks, identifies gaps in linguistic phenomena coverage, and proposes the development of standardized evaluation metrics and categories.

Findings

01

No standard naming convention for linguistic categories.

02

Lack of a unified set of linguistic phenomena in benchmarks.

03

Potential benefits of standardized diagnostics metrics.

Abstract

Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.