Ensemble Distillation for Unsupervised Constituency Parsing

Behzad Shayegh; Yanshuai Cao; Xiaodan Zhu; Jackie C.K. Cheung; Lili; Mou

arXiv:2310.01717·cs.CL·April 29, 2024

Ensemble Distillation for Unsupervised Constituency Parsing

Behzad Shayegh, Yanshuai Cao, Xiaodan Zhu, Jackie C.K. Cheung, Lili, Mou

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces an ensemble distillation approach for unsupervised constituency parsing, leveraging multiple models' diverse insights to improve accuracy and robustness without labeled data.

Contribution

It proposes a novel ensemble method based on tree averaging and a distillation process to enhance unsupervised parsing performance and efficiency.

Findings

01

Outperforms previous methods in accuracy

02

Demonstrates robustness across different runs and domains

03

Mitigates over-smoothing in distillation

Abstract

We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture differing aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance. To this end, we propose a notion of "tree averaging," based on which we further propose a novel ensemble method for unsupervised parsing. To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods. Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 2

Strengths

- The proposed method is very simple to combine multiple outputs from unsupervised parsing, and the method might have an impact to other system combination method, e.g., NER with CRF. The ensemble by hit-count is sound and the merit is proved effective in the experiments especially when comparing MBR which can consider only spans in the multiple system outputs. - Experiments are well designed and the effect of the proposed method is proved empirically. This work also presents knowledge distilla

Weaknesses

- It is comparing only for English, and it would be better to compare the model with other languages, e.g., Chinese, for further strengthening this submission.

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

The major contributions include: 1. A new notion of tree averaging and the corresponding search algorithm: CYK variant 2. Ensemble-then-distill approach that trains a student parser from an ensemble of teachers. 3. The inference time of the student model is 18x faster than the ensemble method. 4. A hypothesis that different unsupervised parsers capture different aspects of the language structures and the verification with experiments.

Weaknesses

1. Lack of clarification regarding the methodology design. - The averaging tree is derived with the highest total F1 score compared with different teachers. Have the authors tried other methods of calculating the similarities between trees? Perhaps a fair comparison is needed to further indicate the effectiveness of the proposed tree averaging method. - The authors did not provide a detailed explanation for choosing the seven unsupervised parsers introduced in Section 3.2 as teacher models. For

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

- The insight that different models are weakly correlated despite similar F1 is interesting and well motivates the approach - The proposed dynamic program is intuitive and well explained - Experiments are strong and thorough - The analysis of gains from denoising vs. difference in expertise is well conducted

Weaknesses

- The fact that the F1 gains from distillation do not carry over to the out of domain setting is a drawback and somewhat underexplored - There is a lack of qualitative analysis of the types of behaviors that different model types exhibit, and how ensembling actually combines those. Some of this is done in the Appendix but it would be nice to see specific examples in the main paper, especially since that analysis is wrt constituency labels which the model isn’t actually being evaluated on.

Code & Models

Repositories

manga-uofa/ed4ucp
noneOfficial

Videos

Ensemble Distillation for Unsupervised Constituency Parsing· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications