Improving Chinese Segmentation-free Word Embedding With Unsupervised   Association Measure

Yifan Zhang; Maohua Wang; Yongjian Huang; Qianrong Gu

arXiv:2007.02342·cs.CL·July 8, 2020

Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

Yifan Zhang, Maohua Wang, Yongjian Huang, Qianrong Gu

PDF

Open Access

TL;DR

This paper introduces an unsupervised association measure called PATI to improve segmentation-free Chinese word embedding by filtering noisy n-grams, leading to better embedding quality and downstream task performance.

Contribution

It proposes a novel PATI measure for n-gram selection, enhancing segmentation-free embedding quality over traditional frequency and PMI methods.

Findings

01

PATI outperforms frequency and PMI in selecting meaningful n-grams.

02

Improved embeddings enhance downstream task accuracy.

03

Model shows robustness on Chinese SNS data.

Abstract

Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding vocabulary that do not have strong association strength between characters would limit the quality of learned word embedding. To deal with this problem, a new version of segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI). Comparing with the commonly used n-gram filtering method like frequency used in sembei and pointwise mutual information(PMI), the proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis