Improving Unsupervised Constituency Parsing via Maximizing Semantic   Information

Junjie Chen; Xiangheng He; Yusuke Miyao; Danushka Bollegala

arXiv:2410.02558·cs.CL·April 7, 2025

Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

Junjie Chen, Xiangheng He, Yusuke Miyao, Danushka Bollegala

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a new training objective for unsupervised constituency parsers that maximizes semantic information, leading to significant improvements in parsing accuracy across multiple languages and models.

Contribution

It introduces SemInfo, a semantic information-based objective, and demonstrates its effectiveness in enhancing unsupervised constituency parsing performance.

Findings

01

SemInfo correlates more strongly with parsing accuracy than log-likelihood

02

Achieves an average of 7.85 F1 score improvement across models and languages

03

Attains state-of-the-art results in three out of four languages

Abstract

Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective that trains parsers by maximizing SemInfo, the semantic information encoded in constituent structures. We introduce a bag-of-substrings model to represent the semantics and estimate the SemInfo value using the probability-weighted information metric. We apply the SemInfo maximization objective to training Probabilistic Context-Free Grammar (PCFG) parsers and develop a Tree Conditional Random Field (TreeCRF)-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junjiechen-chris/improving-unsupervised-constituency-parsing-via-maximizing-semantic-information
pytorchOfficial

Videos

Improving Unsupervised Constituency Parsing via Maximizing Semantic Information· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies