Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure
Yifan Zhang, Maohua Wang, Yongjian Huang, Qianrong Gu

TL;DR
This paper introduces an unsupervised association measure called PATI to improve segmentation-free Chinese word embedding by filtering noisy n-grams, leading to better embedding quality and downstream task performance.
Contribution
It proposes a novel PATI measure for n-gram selection, enhancing segmentation-free embedding quality over traditional frequency and PMI methods.
Findings
PATI outperforms frequency and PMI in selecting meaningful n-grams.
Improved embeddings enhance downstream task accuracy.
Model shows robustness on Chinese SNS data.
Abstract
Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding vocabulary that do not have strong association strength between characters would limit the quality of learned word embedding. To deal with this problem, a new version of segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI). Comparing with the commonly used n-gram filtering method like frequency used in sembei and pointwise mutual information(PMI), the proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
