GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
Adrian-Marius Dumitran, Alexandra-Mihaela Danila, Angela-Liliana Dumitran

TL;DR
GRILE is a new benchmark for evaluating Romanian language models on grammar reasoning and explanations, revealing current limitations and guiding future educational NLP research in low-resource languages.
Contribution
Introduces GRILE, the first open benchmark with 1,151 Romanian exam questions to assess LLMs' answer accuracy and explanation quality, highlighting systematic weaknesses.
Findings
Gemini 2.5 Pro achieves 83% accuracy
Most open-weight models score below 65%
48% of explanations contain factual or pedagogical flaws
Abstract
LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Legal Language and Interpretation
