Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction
Maksym Tarnavskyi, Artem Chernodub, Kostiantyn Omelianchuk

TL;DR
This paper enhances grammatical error correction by ensembling Transformer-based sequence taggers and using knowledge distillation to create synthetic datasets, achieving state-of-the-art results without synthetic pre-training.
Contribution
It introduces a novel ensembling approach for sequence taggers and demonstrates effective knowledge distillation for generating training data, improving GEC performance.
Findings
Ensembling models achieves a new SOTA $F_{0.5}$ score of 76.05 on BEA-2019.
Knowledge distillation with ensemble-generated data improves single model performance.
The best single model achieves an $F_{0.5}$ score of 73.21, close to heavier models.
Abstract
In this paper, we investigate improvements to the GEC sequence tagging architecture with a focus on ensembling of recent cutting-edge Transformer-based encoders in Large configurations. We encourage ensembling models by majority votes on span-level edits because this approach is tolerant to the model architecture and vocabulary size. Our best ensemble achieves a new SOTA result with an score of 76.05 on BEA-2019 (test), even without pre-training on synthetic datasets. In addition, we perform knowledge distillation with a trained ensemble to generate new synthetic training datasets, "Troy-Blogs" and "Troy-1BW". Our best single sequence tagging model that is pretrained on the generated Troy-datasets in combination with the publicly available synthetic PIE dataset achieves a near-SOTA (To the best of our knowledge, our best single model gives way only to much heavier T5 model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Adafactor · Refunds@Expedia|||How do I get a full refund from Expedia? · SentencePiece · Gated Linear Unit · Dropout
