EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

Fan Gao; Dongyuan Li; Ding Xia; Fei Mi; Yasheng Wang; Lifeng Shang; Baojun Wang

arXiv:2506.02596·cs.CL·June 4, 2025

EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

Fan Gao, Dongyuan Li, Ding Xia, Fei Mi, Yasheng Wang, Lifeng Shang, Baojun Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces enchName, a comprehensive multi-genre benchmark for evaluating large language models' Chinese essay writing and assessment capabilities, addressing gaps in existing evaluation methods.

Contribution

We propose a new multi-genre benchmark with authentic prompts and a detailed scoring framework for Chinese essays, and evaluate 15 large language models across genres and instruction types.

Findings

01

Benchmark reveals strengths and limitations of LLMs in Chinese essay writing

02

Fine-grained scoring improves evaluation reliability

03

Analysis highlights genre-specific model performance

Abstract

Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing· underline

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods