Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu; Meeri-Ly Muru; Sten Marcus Malva

arXiv:2501.15624·cs.CL·January 26, 2026

Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva

PDF

Open Access

TL;DR

This paper introduces a novel Estonian text simplification approach utilizing fine-tuned large language models and custom datasets, demonstrating superior performance over traditional NMT models in low-resource language contexts.

Contribution

It develops a new dataset combining manual and GPT-4.0-generated simplifications and fine-tunes LLaMA, showing improved results over existing NMT systems for Estonian text simplification.

Findings

01

LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation.

02

Created a publicly available dataset and tools for Estonian text simplification.

03

Highlights the effectiveness of large language models in low-resource language tasks.

Abstract

This paper presents a method for text simplification based on two neural architectures: a neural machine translation (NMT) model and a fine-tuned large language model (LLaMA). Given the scarcity of existing resources for Estonian, a new dataset was created by combining manually translated corpora with GPT-4.0-generated simplifications. OpenNMT was selected as a representative NMT-based system, while LLaMA was fine-tuned on the constructed dataset. Evaluation shows LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation. These results underscore the effectiveness of large language models for text simplification in low-resource language settings. The complete dataset, fine-tuning scripts, and evaluation pipeline are provided in a publicly accessible supplementary package to support reproducibility and adaptation to other languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing