Generalizing Case Frames Using a Thesaurus and the MDL Principle
Hang Li, Naoki Abe (C&C Res. Labs.,NEC)

TL;DR
This paper presents a novel method for automatically generalizing case-frame patterns from large corpora by leveraging a thesaurus and the MDL principle, leading to improved disambiguation performance.
Contribution
It introduces a new MDL-based approach that uses thesaurus tree cuts to efficiently generalize case-frame patterns from corpus data.
Findings
Method improves or matches existing approaches in case-frame pattern acquisition.
Algorithm efficiently finds optimal tree cut models based on frequency data.
Application to pp-attachment disambiguation shows practical effectiveness.
Abstract
We address the problem of automatically acquiring case-frame patterns from large corpus data. In particular, we view this problem as the problem of estimating a (conditional) distribution over a partition of words, and propose a new generalization method based on the MDL (Minimum Description Length) principle. In order to assist with the efficiency, our method makes use of an existing thesaurus and restricts its attention on those partitions that are present as `cuts' in the thesaurus tree, thus reducing the generalization problem to that of estimating the `tree cut models' of the thesaurus. We then give an efficient algorithm which provably obtains the optimal tree cut model for the given frequency data, in the sense of MDL. We have used the case-frame patterns obtained using our method to resolve pp-attachment ambiguity.Our experimental results indicate that our method improves upon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
