Compacting the Penn Treebank Grammar
Alexander Krotov, Mark Hepple, Robert Gaizauskas, Yorick Wilks, (Department of Computer Science, University of Sheffield, UK)

TL;DR
This paper investigates the size and completeness of the Penn Treebank grammar, proposing a rule compaction method that significantly reduces grammar size while maintaining parsing performance, highlighting the potential for more efficient grammar representations.
Contribution
The paper introduces a rule compaction algorithm for PTB grammar that reduces size and approaches a limit, with an enhanced probabilistic version improving linguistic plausibility.
Findings
Compacted grammar size approaches a limit with reduced redundancy.
Rule probability-based compaction maintains parsing performance.
Significant grammar size reduction (up to 69%) achieved with minimal performance loss.
Abstract
Treebanks, such as the Penn Treebank (PTB), offer a simple approach to obtaining a broad coverage grammar: one can simply read the grammar off the parse trees in the treebank. While such a grammar is easy to obtain, a square-root rate of growth of the rule set with corpus size suggests that the derived grammar is far from complete and that much more treebanked text would be required to obtain a complete grammar, if one exists at some limit. However, we offer an alternative explanation in terms of the underspecification of structures within the treebank. This hypothesis is explored by applying an algorithm to compact the derived grammar by eliminating redundant rules -- rules whose right hand sides can be parsed by other rules. The size of the resulting compacted grammar, which is significantly less than that of the full treebank grammar, is shown to approach a limit. However, such a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
