TL;DR
This paper introduces fugashi, a Python wrapper for MeCab, simplifying Japanese tokenization for NLP projects and addressing usability and documentation issues of existing tools.
Contribution
The paper presents fugashi, a new Python tool that makes Japanese tokenization easier and more accessible for NLP applications.
Findings
Fugashi simplifies Japanese tokenization process.
It improves usability over existing tools.
Provides better documentation and integration.
Abstract
Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
