User-Generated Text Corpus for Evaluating Japanese Morphological   Analysis and Lexical Normalization

Shohei Higashiyama; Masao Utiyama; Taro Watanabe; Eiichiro Sumita

arXiv:2104.03523·cs.CL·April 9, 2021

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

Shohei Higashiyama, Masao Utiyama, Taro Watanabe, Eiichiro Sumita

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new Japanese user-generated text corpus for evaluating morphological analysis and lexical normalization, highlighting the challenges faced by current methods on non-standard language forms.

Contribution

The authors created a publicly available annotated corpus for Japanese UGT, providing a benchmark for future research in morphological analysis and normalization tasks.

Findings

01

Existing MA/LN methods perform poorly on non-standard forms.

02

The corpus reveals significant challenges in analyzing UGT.

03

Benchmark results indicate room for improvement in current systems.

Abstract

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shigashiyama/jlexnorm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification