TL;DR
This paper introduces DETECT, a novel German-specific metric for evaluating automatic text simplification that aligns closely with human judgments across simplicity, meaning preservation, and fluency, using synthetic data and LLMs.
Contribution
DETECT is the first comprehensive German evaluation metric for ATS, leveraging synthetic LLM responses and extending the LENS framework, validated on the largest German human evaluation dataset.
Findings
DETECT outperforms existing metrics in correlating with human judgments.
Synthetic data generation enables large-scale dataset creation without human annotation.
The approach reveals strengths and limitations of LLMs in automatic evaluation.
Abstract
Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
