Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling
Leonard K. M. Poon, Nevin L. Zhang, Haoran Xie, Gary Cheng

TL;DR
This paper introduces a method to improve hierarchical latent tree analysis for topic modeling by incorporating collocations, which enhances the model's ability to handle multiword expressions and improves performance on multiple datasets.
Contribution
The paper proposes a collocation extraction and replacement method as a preprocessing step for HLTA, addressing its limitation in representing multiword expressions.
Findings
Improved HLTA performance on three out of four datasets
Effective collocation extraction and replacement method
Enhanced representation of multiword expressions in topic models
Abstract
Topic modeling has been one of the most active research areas in machine learning in recent years. Hierarchical latent tree analysis (HLTA) has been recently proposed for hierarchical topic modeling and has shown superior performance over state-of-the-art methods. However, the models used in HLTA have a tree structure and cannot represent the different meanings of multiword expressions sharing the same word appropriately. Therefore, we propose a method for extracting and selecting collocations as a preprocessing step for HLTA. The selected collocations are replaced with single tokens in the bag-of-words model before running HLTA. Our empirical evaluation shows that the proposed method led to better performance of HLTA on three of the four data sets tested.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Data Mining Algorithms and Applications
