Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-Judge

Xuanxin Wu; Yuki Arase; Masaaki Nagata

arXiv:2512.06228·cs.CL·December 9, 2025

Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-Judge

Xuanxin Wu, Yuki Arase, Masaaki Nagata

PDF

Open Access 3 Reviews

TL;DR

This paper presents a novel approach using Large Language Model-as-a-Judge to automatically generate policy-aligned training data for sentence simplification, eliminating the need for parallel corpora and enabling adaptable, policy-driven simplification systems.

Contribution

It introduces a method that leverages LLMs as judges to create training data aligned with specific simplification policies, improving flexibility and reducing reliance on costly human annotations.

Findings

01

Small open-source LLMs outperform GPT-4o on lexical simplification.

02

The approach achieves comparable results to GPT-4o on sentence rewriting.

03

The method demonstrates robustness across different model sizes and families.

Abstract

Sentence simplification aims to modify a sentence to make it easier to read and understand while preserving the meaning. Different applications require distinct simplification policies, such as replacing only complex words at the lexical level or rewriting the entire sentence while trading off details for simplicity. However, achieving such policy-driven control remains an open challenge. In this work, we introduce a simple yet powerful approach that leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for costly human annotation or parallel corpora. Our method enables building simplification systems that adapt to diverse simplification policies. Remarkably, even small-scale open-source LLMs such as Phi-3-mini-3.8B surpass GPT-4o on lexical-oriented simplification, while achieving comparable…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

- convincing motivation for the work, showing the need for an approach that goes beyond dedicated seq-to-seq models as well as beyond large-scale LLMs, and that allows for adapting to a specific simplification policy - state-of-the-art preference optimisation approach combining ARPO and SimPO - strong baselines/toplines - good (and somehow reassuring) results

Weaknesses

- no real weakness for me, this is a good paper - one detail: the fact that SARI has been shown to only weakly correlate with human judgement could be mentioned, and the use of SARI nevertheless could be motivated—this is one of the reasons why human evaluation is important, which the authors include in their paper - another detail: the link between use cases and possible policies could be discussed a bit more (e.g. what about simplification targeted towards people with cognitive disabilities? w

Reviewer 02Rating 4Confidence 5

Strengths

The paper is clear

Weaknesses

The whole pipeline of constructing preference pairs and apply the preference learning algorithms has been used in the past 2 years since the RLHF came out. For example, "Self-Rewarding Language Models" ICML 2024. And the paper just apply it onto a specific task. The paper use the GPT-4o as an upper bound, but it is a quite old model, should have the evaluation of the current best propreitary model (GPT-5)'s performance on sentence simplification, which is likely a solved task. And GPT-5 is also

Reviewer 03Rating 2Confidence 3

Strengths

In this manuscript, the authors aim to address the challenges of policy-driven control in sentence simplification, including the lack of policy-specific parallel corpora and the poor policy alignment of small-scale open-source LLMs. The proposed LLM-as-a-Judge automatically constructs policy-aligned preference data and uses preference optimization to fine-tune open-source LLMs. Experimental results across automatic metrics and human evaluations show that the method outperforms baselines. Notably

Weaknesses

There are some concerns for the manuscript as follows: 1.The proposed LLM-as-a-Judge select high-quality preference data between overall-rewriting and lexical-paraphrasing policies. It means that the candidate simplifications will be generated by various LLMs, so what is the computational cost of this method? This is an important issue, but it is not discussed in the experiments. 2.The proposed LLM-as-a-Judge selected preference data and fine-tune open source LLMs, the important preference optim

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Artificial Intelligence in Healthcare and Education · Second Language Acquisition and Learning