More Than Words: Collocation Tokenization for Latent Dirichlet   Allocation Models

Jin Cheevaprawatdomrong; Alexandra Schofield; Attapol T. Rutherford

arXiv:2108.10755·cs.CL·August 25, 2021·1 cites

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

Jin Cheevaprawatdomrong, Alexandra Schofield, Attapol T. Rutherford

PDF

Open Access

TL;DR

This paper investigates advanced tokenization methods, including statistical tests and encoding, to improve LDA topic modeling for languages without clear word boundaries, resulting in clearer and more coherent topics.

Contribution

It introduces new tokenization techniques for LDA that enhance topic quality in languages like Chinese and Thai, with a novel clustering metric for evaluation.

Findings

01

Merged tokens lead to clearer, more coherent topics

02

Statistical and encoding methods improve tokenization quality

03

Enhanced topic distinction in non-segmented languages

Abstract

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared test, t-statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The Chi-squared, t, and WPE tokenizers are trained on Wikipedia text to look for words that should be grouped together, such as compound nouns, proper nouns, and complex event verbs. We propose a new metric for measuring the clustering quality in settings where the vocabularies of the models differ. Based on this metric and other established metrics, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsLinear Discriminant Analysis