Segmentation-free Compositional $n$-gram Embedding
Geewook Kim, Kazuki Fukui, Hidetoshi Shimodaira

TL;DR
This paper introduces a segmentation-free embedding method that models character n-grams directly from raw, unsegmented text, effectively handling noisy corpora in languages like Chinese and Japanese.
Contribution
It presents a novel segmentation-free approach to learn embeddings for all character n-grams without relying on word boundaries or annotated resources.
Findings
Effective on noisy, unsegmented corpora
Outperforms segmentation-based methods on benchmarks
Applicable to Chinese and Japanese text
Abstract
We propose a new type of representation learning method that models words, phrases and sentences seamlessly. Our method does not depend on word segmentation and any human-annotated resources (e.g., word dictionaries), yet it is very effective for noisy corpora written in unsegmented languages such as Chinese and Japanese. The main idea of our method is to ignore word boundaries completely (i.e., segmentation-free), and construct representations for all character -grams in a raw corpus with embeddings of compositional sub--grams. Although the idea is simple, our experiments on various benchmarks and real-world datasets show the efficacy of our proposal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
