On the Definition of Japanese Word
Yugo Murawaki

TL;DR
This paper examines the definition of Japanese words in the context of Universal Dependencies, arguing that the current annotation units may not align with linguistic criteria and discussing implications for corpus annotation.
Contribution
It critically analyzes the concept of syntactic words in Japanese for dependency parsing and evaluates the use of Short Unit Words versus traditional bunsetsu units.
Findings
Current UD Japanese treebanks use Short Unit Words.
Traditional bunsetsu units are linguistically grounded.
Adopting unfamiliar criteria involves trade-offs.
Abstract
The annotation guidelines for Universal Dependencies (UD) stipulate that the basic units of dependency annotation are syntactic words, but it is not clear what are syntactic words in Japanese. Departing from the long tradition of using phrasal units called bunsetsu for dependency parsing, the current UD Japanese treebanks adopt the Short Unit Words. However, we argue that they are not syntactic word as specified by the annotation guidelines. Although we find non-mainstream attempts to linguistically define Japanese words, such definitions have never been applied to corpus annotation. We discuss the costs and benefits of adopting the rather unfamiliar criteria.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
