Japanese Lexical Complexity for Non-Native Readers: A New Dataset

Yusuke Ide; Masato Mita; Adam Nohejl; Hiroki Ouchi; Taro Watanabe

arXiv:2306.17399·cs.CL·July 3, 2023

Japanese Lexical Complexity for Non-Native Readers: A New Dataset

Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, Taro Watanabe

PDF

Open Access 2 Repos

TL;DR

This paper introduces the first Japanese lexical complexity prediction dataset, enabling better understanding and annotation of complex words for non-native readers, with a BERT-based system demonstrating effective performance.

Contribution

It creates a novel Japanese LCP dataset with L1-specific scores and evaluates a BERT-based model for lexical complexity prediction.

Findings

01

BERT-based system effectively predicts Japanese lexical complexity.

02

Dataset includes L1-specific complexity scores.

03

First Japanese LCP dataset developed.

Abstract

Lexical complexity prediction (LCP) is the task of predicting the complexity of words in a text on a continuous scale. It plays a vital role in simplifying or annotating complex words to assist readers. To study lexical complexity in Japanese, we construct the first Japanese LCP dataset. Our dataset provides separate complexity scores for Chinese/Korean annotators and others to address the readers' L1-specific needs. In the baseline experiment, we demonstrate the effectiveness of a BERT-based system for Japanese LCP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling