Can Language Models Evaluate Human Written Text? Case Study on Korean   Student Writing for Education

Seungyoon Kim; Seungone Kim

arXiv:2407.17022·cs.CL·July 25, 2024

Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education

Seungyoon Kim, Seungone Kim

PDF

1 Repo

TL;DR

This study explores the effectiveness of large language models, specifically GPT-4-Turbo, in evaluating human-written Korean student texts for educational feedback, focusing on various writing quality criteria.

Contribution

It demonstrates that LLMs can reliably assess certain aspects of human writing, such as grammaticality and fluency, in educational contexts, and provides a new dataset for further research.

Findings

01

LLMs reliably evaluate grammaticality and fluency.

02

Struggle with subjective criteria like coherence and relevance.

03

Public dataset and feedback released for future research.

Abstract

Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text. Extending this methodology to assess human-written text could significantly benefit educational settings by providing direct feedback to enhance writing skills, although this application is not straightforward. In this paper, we investigate whether LLMs can effectively assess human-written text for educational purposes. We collected 100 texts from 32 Korean students across 15 types of writing and employed GPT-4-Turbo to evaluate them using grammaticality, fluency, coherence, consistency, and relevance as criteria. Our analyses indicate that LLM evaluators can reliably assess grammaticality and fluency, as well as more objective types of writing, though they struggle with other criteria and types of writing. We publicly release our dataset and feedback.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seungyoon1/llm-as-a-judge-human-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.