German4All -- A Dataset and Model for Readability-Controlled Paraphrasing in German

Miriam Ansch\"utz; Thanh Mai Pham; Eslam Nasrallah; Maximilian M\"uller; Cristian-George Craciun; Georg Groh

arXiv:2508.17973·cs.CL·September 1, 2025

German4All -- A Dataset and Model for Readability-Controlled Paraphrasing in German

Miriam Ansch\"utz, Thanh Mai Pham, Eslam Nasrallah, Maximilian M\"uller, Cristian-George Craciun, Georg Groh

PDF

1 Datasets

TL;DR

German4All is a large-scale dataset and model for generating paraphrases in German at different readability levels, facilitating accessible and tailored texts for diverse readers.

Contribution

Introduces the first large-scale German dataset of readability-controlled paraphrases and trains a state-of-the-art model for German text simplification.

Findings

01

Dataset spans five readability levels with over 25,000 samples.

02

Model achieves state-of-the-art performance in German text simplification.

03

Both dataset and model are open-sourced for further research.

Abstract

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tum-nlp/German4All-Corpus
dataset· 202 dl
202 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.