A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks
Elie Antoine (LIS, TALEP), Fr\'ed\'eric B\'echet (LIS, TALEP),, G\'eraldine Damnati, Philippe Langlais (DIRO)

TL;DR
This paper proposes a linguistically-motivated evaluation method for reading comprehension models, using semantic frame annotation to identify linguistic complexities that challenge models regardless of size or architecture.
Contribution
It introduces a novel evaluation approach based on semantic complexity factors, validated on French and English benchmarks, to better understand model limitations in reading comprehension.
Findings
Semantic complexity factors predict model failures.
Fine-grained automatic evaluation reveals specific linguistic challenges.
State-of-the-art models struggle with certain linguistic features.
Abstract
We introduce an evaluation methodology for reading comprehension tasks based on the intuition that certain examples, by the virtue of their linguistic complexity, consistently yield lower scores regardless of model size or architecture. We capitalize on semantic frame annotation for characterizing this complexity, and study seven complexity factors that may account for model's difficulty. We first deploy this methodology on a carefully annotated French reading comprehension benchmark showing that two of those complexity factors are indeed good predictors of models' failure, while others are less so. We further deploy our methodology on a well studied English benchmark by using Chat-GPT as a proxy for semantic annotation. Our study reveals that fine-grained linguisticallymotivated automatic evaluation of a reading comprehension task is not only possible, but helps understand models'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
