Ensemble Distillation for Unsupervised Constituency Parsing
Behzad Shayegh, Yanshuai Cao, Xiaodan Zhu, Jackie C.K. Cheung, Lili, Mou

TL;DR
This paper introduces an ensemble distillation approach for unsupervised constituency parsing, leveraging multiple models' diverse insights to improve accuracy and robustness without labeled data.
Contribution
It proposes a novel ensemble method based on tree averaging and a distillation process to enhance unsupervised parsing performance and efficiency.
Findings
Outperforms previous methods in accuracy
Demonstrates robustness across different runs and domains
Mitigates over-smoothing in distillation
Abstract
We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture differing aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance. To this end, we propose a notion of "tree averaging," based on which we further propose a novel ensemble method for unsupervised parsing. To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods. Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different…
Peer Reviews
Decision·ICLR 2024 poster
- The proposed method is very simple to combine multiple outputs from unsupervised parsing, and the method might have an impact to other system combination method, e.g., NER with CRF. The ensemble by hit-count is sound and the merit is proved effective in the experiments especially when comparing MBR which can consider only spans in the multiple system outputs. - Experiments are well designed and the effect of the proposed method is proved empirically. This work also presents knowledge distilla
- It is comparing only for English, and it would be better to compare the model with other languages, e.g., Chinese, for further strengthening this submission.
The major contributions include: 1. A new notion of tree averaging and the corresponding search algorithm: CYK variant 2. Ensemble-then-distill approach that trains a student parser from an ensemble of teachers. 3. The inference time of the student model is 18x faster than the ensemble method. 4. A hypothesis that different unsupervised parsers capture different aspects of the language structures and the verification with experiments.
1. Lack of clarification regarding the methodology design. - The averaging tree is derived with the highest total F1 score compared with different teachers. Have the authors tried other methods of calculating the similarities between trees? Perhaps a fair comparison is needed to further indicate the effectiveness of the proposed tree averaging method. - The authors did not provide a detailed explanation for choosing the seven unsupervised parsers introduced in Section 3.2 as teacher models. For
- The insight that different models are weakly correlated despite similar F1 is interesting and well motivates the approach - The proposed dynamic program is intuitive and well explained - Experiments are strong and thorough - The analysis of gains from denoising vs. difference in expertise is well conducted
- The fact that the F1 gains from distillation do not carry over to the out of domain setting is a drawback and somewhat underexplored - There is a lack of qualitative analysis of the types of behaviors that different model types exhibit, and how ensembling actually combines those. Some of this is done in the Appendix but it would be nice to see specific examples in the main paper, especially since that analysis is wrt constituency labels which the model isn’t actually being evaluated on.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
