DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification
Regina Stodden, Omar Momen, Laura Kallmeyer

TL;DR
DEplain is a newly created German parallel corpus with professionally aligned simplifications in plain German, designed to improve automatic sentence and document simplification systems, and includes tools for expanding and aligning additional data.
Contribution
The paper introduces DEplain, a high-quality German parallel corpus for text simplification, along with alignment methods and tools to facilitate corpus expansion and model training.
Findings
Transformer-based models trained on DEplain show promising results.
The corpus includes over 15,000 sentence pairs from news and web domains.
Tools for automatic alignment and web harvesting are developed and shared.
Abstract
Text simplification is an intralingual translation task in which documents, or sentences of a complex source text are simplified for a target audience. The success of automatic text simplification systems is highly dependent on the quality of parallel data used for training and evaluation. To advance sentence simplification and document simplification in German, this paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German ("plain DE" or in German: "Einfache Sprache"). DEplain consists of a news domain (approx. 500 document pairs, approx. 13k sentence pairs) and a web-domain corpus (approx. 150 aligned documents, approx. 2k aligned sentence pairs). In addition, we are building a web harvester and experimenting with automatic alignment methods to facilitate the integration of non-aligned and to be published parallel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗DEplain/trimmed_longmbart_docs_apamodel· 4 dl4 dl
- 🤗DEplain/trimmed_mbart_sents_apamodel· 3 dl3 dl
- 🤗DEplain/trimmed_mbart_sents_apa_webmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗vera-8/mT5-large-trimmed_deplain-apamodel
- 🤗vera-8/mT5-xl-trimmed_deplain-apamodel
- 🤗vera-8/mT5-small-VT-span-mlm_deplain-apamodel
- 🤗vera-8/mT5-base-trimmed_deplain-apamodel
- 🤗vera-8/mT5-small-VT_deplain-apamodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
MethodsmBART · Longformer · Sequence to Sequence
