Lexical Complexity Prediction and Lexical Simplification for Catalan and   Spanish: Resource Creation, Quality Assessment, and Ethical Considerations

Stefan Bott; Horacio Saggion; Nelson Per\'ez Rojas; Martin Solis; Salazar; Saul Calderon Ramirez

arXiv:2404.07814·cs.CL·February 21, 2025·1 cites

Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations

Stefan Bott, Horacio Saggion, Nelson Per\'ez Rojas, Martin Solis, Salazar, Saul Calderon Ramirez

PDF

Open Access 1 Datasets

TL;DR

This paper introduces new datasets for lexical simplification in Catalan and Spanish, including scalar difficulty ratings, and evaluates their quality and ethical considerations, advancing resources for multilingual language processing.

Contribution

It provides the first Catalan lexical simplification dataset and a Spanish dataset with scalar ratings, along with an analysis of data quality and ethical issues.

Findings

01

First Catalan lexical simplification dataset created.

02

Spanish dataset includes scalar ratings of difficulty.

03

Assessment of data quality and ethical considerations conducted.

Abstract

Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MLSP2024/MLSP2024
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques