TL;DR
This study explores the effectiveness of large language models, specifically GPT-4-Turbo, in evaluating human-written Korean student texts for educational feedback, focusing on various writing quality criteria.
Contribution
It demonstrates that LLMs can reliably assess certain aspects of human writing, such as grammaticality and fluency, in educational contexts, and provides a new dataset for further research.
Findings
LLMs reliably evaluate grammaticality and fluency.
Struggle with subjective criteria like coherence and relevance.
Public dataset and feedback released for future research.
Abstract
Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text. Extending this methodology to assess human-written text could significantly benefit educational settings by providing direct feedback to enhance writing skills, although this application is not straightforward. In this paper, we investigate whether LLMs can effectively assess human-written text for educational purposes. We collected 100 texts from 32 Korean students across 15 types of writing and employed GPT-4-Turbo to evaluate them using grammaticality, fluency, coherence, consistency, and relevance as criteria. Our analyses indicate that LLM evaluators can reliably assess grammaticality and fluency, as well as more objective types of writing, though they struggle with other criteria and types of writing. We publicly release our dataset and feedback.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
