Improving Unsupervised Constituency Parsing via Maximizing Semantic Information
Junjie Chen, Xiangheng He, Yusuke Miyao, Danushka Bollegala

TL;DR
This paper proposes a new training objective for unsupervised constituency parsers that maximizes semantic information, leading to significant improvements in parsing accuracy across multiple languages and models.
Contribution
It introduces SemInfo, a semantic information-based objective, and demonstrates its effectiveness in enhancing unsupervised constituency parsing performance.
Findings
SemInfo correlates more strongly with parsing accuracy than log-likelihood
Achieves an average of 7.85 F1 score improvement across models and languages
Attains state-of-the-art results in three out of four languages
Abstract
Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective that trains parsers by maximizing SemInfo, the semantic information encoded in constituent structures. We introduce a bag-of-substrings model to represent the semantics and estimate the SemInfo value using the probability-weighted information metric. We apply the SemInfo maximization objective to training Probabilistic Context-Free Grammar (PCFG) parsers and develop a Tree Conditional Random Field (TreeCRF)-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
