Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding
Zeming Chen, Qiyue Gao

TL;DR
This paper introduces Curriculum, a comprehensive benchmark for evaluating language models on 36 linguistic phenomena, aiming to diagnose model capabilities and limitations in natural language understanding.
Contribution
It presents a new broad-coverage NLI benchmark with diverse datasets and an evaluation procedure to assess linguistic reasoning skills in language models.
Findings
Curriculum effectively diagnoses model strengths and weaknesses.
Existing benchmarks have limitations in covering diverse linguistic phenomena.
The benchmark reveals gaps in current models' understanding.
Abstract
In the age of large transformer language models, linguistic evaluation play an important role in diagnosing models' abilities and limitations on natural language understanding. However, current evaluation methods show some significant shortcomings. In particular, they do not provide insight into how well a language model captures distinct linguistic skills essential for language understanding and reasoning. Thus they fail to effectively map out the aspects of language understanding that remain challenging to existing models, which makes it hard to discover potential limitations in models and datasets. In this paper, we introduce Curriculum as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena. Curriculum contains a collection of datasets that covers 36 types of major linguistic phenomena and an evaluation procedure for diagnosing how well a language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
