A Topological Approach to Compare Document Semantics Based on a New Variant of Syntactic N-grams
Fanchao Meng

TL;DR
This paper introduces a topological method using a new variant of syntactic n-grams called generalized phrases (GPs) to improve document semantic comparison, outperforming existing embedding-based techniques.
Contribution
The paper proposes a novel variant of syntactic n-grams (GPs) and a topological approach (DSCoH) for better document semantic similarity measurement.
Findings
DSCoH outperforms state-of-the-art embedding methods in experiments.
GPs address key issues of traditional sn-grams like significance and sensitivity.
The approach is effective in document clustering tasks.
Abstract
This paper delivers a new perspective of thinking and utilizing syntactic n-grams (sn-grams). Sn-grams are a type of non-linear n-grams which have been playing a critical role in many NLP tasks. Introducing sn-grams to comparing document semantics thus is an appealing application, and few studies have reported progress at this. However, when proceeding on this application, we found three major issues of sn-grams: lack of significance, being sensitive to word orders and failing on capture indirect syntactic relations. To address these issues, we propose a new variant of sn-grams named generalized phrases (GPs). Then based on GPs we propose a topological approach, named DSCoH, to compute document semantic similarities. DSCoH has been extensively tested on the document semantics comparison and the document clustering tasks. The experimental results show that DSCoH can outperform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsGreedy Policy Search
