OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

Hannah Liu; Muxin Tian; Iqra Ali; Haonan Gao; Qiaoyiwen Wu; Blair Yang; Uthayasanker Thayasivam; En-Shiun Annie Lee; Pakawat Nakwijit; Surangika Ranathunga; Ravi Shekhar

arXiv:2603.14111·cs.CL·March 17, 2026

OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

Hannah Liu, Muxin Tian, Iqra Ali, Haonan Gao, Qiaoyiwen Wu, Blair Yang, Uthayasanker Thayasivam, En-Shiun Annie Lee, Pakawat Nakwijit, Surangika Ranathunga, Ravi Shekhar

PDF

Open Access

TL;DR

OasisSimp introduces a multilingual sentence simplification dataset covering five languages, including three with no prior datasets, and evaluates LLM performance, revealing challenges in low-resource language simplification.

Contribution

The paper presents the first multilingual sentence simplification dataset for five languages, including three low-resource languages, and benchmarks LLM performance, highlighting existing limitations.

Findings

01

Significant performance gaps between high-resource and low-resource languages.

02

Current LLMs struggle with low-resource language simplification.

03

OasisSimp serves as a new benchmark for multilingual sentence simplification.

Abstract

Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Artificial Intelligence in Healthcare and Education · Authorship Attribution and Profiling