ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
Kai North, Marcos Zampieri, Tharindu Ranasinghe

TL;DR
ALEXSIS-PT is a pioneering multi-candidate dataset for Portuguese lexical simplification, enabling improved model training and cross-lingual research with real newspaper text and multiple substitution options.
Contribution
It introduces the first multi-candidate LS dataset for Brazilian Portuguese, following a new protocol and including newspaper articles, facilitating advanced model development.
Findings
BERTimbau outperformed other models in substitution tasks.
The dataset contains 9,605 candidate substitutions for 387 complex words.
ALEXSIS-PT enables cross-lingual LS research and model evaluation.
Abstract
Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques
MethodsTest · XLM-R · mBERT
