Parsing with Pretrained Language Models, Multiple Datasets, and Dataset   Embeddings

Rob van der Goot; Miryam de Lhoneux

arXiv:2112.03625·cs.CL·December 8, 2021·1 cites

Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

Rob van der Goot, Miryam de Lhoneux

PDF

Open Access 1 Repo

TL;DR

This paper investigates the effectiveness of dataset embeddings in transformer-based multilingual dependency parsers, demonstrating benefits especially for small or low-performing datasets and comparing different embedding strategies.

Contribution

It compares two methods of embedding datasets in transformer models and provides extensive evaluation, showing dataset embedding remains beneficial in modern NLP models.

Findings

01

Embedding dataset information improves parser performance.

02

Encoder-level dataset embedding yields the highest performance gains.

03

Training on combined datasets is comparable to language-based clustering.

Abstract

With an increase of dataset availability, the potential for learning from a variety of data sources has increased. One particular method to improve learning from multiple data sources is to embed the data source during training. This allows the model to learn generalizable features as well as distinguishing features between datasets. However, these dataset embeddings have mostly been used before contextualized transformer-based embeddings were introduced in the field of Natural Language Processing. In this work, we compare two methods to embed datasets in a transformer-based multilingual dependency parser, and perform an extensive evaluation. We show that: 1) embedding the dataset is still beneficial with these models 2) performance increases are highest when embedding the dataset at the encoder level 3) unsurprisingly, we confirm that performance increases are highest for small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://bitbucket.org/robvanderg/dataembs2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis