Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Kais Allkivi

TL;DR
This study develops interpretable machine learning models to classify Estonian learner texts by CEFR level, using linguistically meaningful features, achieving high accuracy and providing insights into language development over time.
Contribution
It introduces a feature selection approach for language proficiency classification that enhances interpretability and stability across text types, with successful application to Estonian learner texts.
Findings
Best classifiers achieved around 90% accuracy.
Writings became more complex over 7-10 years.
Models maintained 80% accuracy on earlier exam data.
Abstract
Using NLP to analyze authentic learner language helps to build automated assessment and feedback tools. It also offers new and extensive insights into the development of second language production. However, there is a lack of research explicitly combining these aspects. This study aimed to classify Estonian proficiency examination writings (levels A2-C1), assuming that careful feature selection can lead to more explainable and generalizable machine learning models for language testing. Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors associated with increasing complexity and correctness, rather than the writing task. Such lexical, morphological, surface, and error features were used to train classification models, which were compared to models that also allowed for other features. The pre-selected features yielded a similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Psychometric Methodologies and Testing · Second Language Acquisition and Learning
