An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment
Xuanxin Wu, Yuki Arase

TL;DR
This paper evaluates large language models' sentence simplification abilities using an error-based human assessment framework, revealing their strengths, limitations, and the inadequacy of current automatic metrics for high-quality simplifications.
Contribution
It introduces an error-based human annotation framework for more reliable evaluation of LLMs in sentence simplification, addressing limitations of existing methods.
Findings
LLMs generate fewer errors than previous models.
GPT-4 and Qwen2.5-72B struggle with lexical paraphrasing.
Automatic metrics lack sensitivity for high-quality LLM outputs.
Abstract
Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
