Segmentation-free Compositional $n$-gram Embedding

Geewook Kim; Kazuki Fukui; Hidetoshi Shimodaira

arXiv:1809.00918·cs.CL·May 30, 2019·1 cites

Segmentation-free Compositional $n$-gram Embedding

Geewook Kim, Kazuki Fukui, Hidetoshi Shimodaira

PDF

Open Access 2 Repos

TL;DR

This paper introduces a segmentation-free embedding method that models character n-grams directly from raw, unsegmented text, effectively handling noisy corpora in languages like Chinese and Japanese.

Contribution

It presents a novel segmentation-free approach to learn embeddings for all character n-grams without relying on word boundaries or annotated resources.

Findings

01

Effective on noisy, unsegmented corpora

02

Outperforms segmentation-based methods on benchmarks

03

Applicable to Chinese and Japanese text

Abstract

We propose a new type of representation learning method that models words, phrases and sentences seamlessly. Our method does not depend on word segmentation and any human-annotated resources (e.g., word dictionaries), yet it is very effective for noisy corpora written in unsegmented languages such as Chinese and Japanese. The main idea of our method is to ignore word boundaries completely (i.e., segmentation-free), and construct representations for all character $n$ -grams in a raw corpus with embeddings of compositional sub- $n$ -grams. Although the idea is simple, our experiments on various benchmarks and real-world datasets show the efficacy of our proposal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies