fugashi, a Tool for Tokenizing Japanese in Python

Paul McCann

arXiv:2010.06858·cs.CL·October 15, 2020

fugashi, a Tool for Tokenizing Japanese in Python

Paul McCann

PDF

1 Repo

TL;DR

This paper introduces fugashi, a Python wrapper for MeCab, simplifying Japanese tokenization for NLP projects and addressing usability and documentation issues of existing tools.

Contribution

The paper presents fugashi, a new Python tool that makes Japanese tokenization easier and more accessible for NLP applications.

Findings

01

Fugashi simplifies Japanese tokenization process.

02

It improves usability over existing tools.

03

Provides better documentation and integration.

Abstract

Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

polm/fugashi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.