On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual and Zero-shot Conditions
Rob van der Goot, Ahmet \"Ust\"un, Barbara Plank

TL;DR
This paper investigates how dataset embeddings influence model performance across monolingual, multilingual, and zero-shot scenarios, revealing they are most effective when training and test data share the same language and distribution.
Contribution
It provides a comprehensive comparison of dataset embedding effectiveness across different language and data distribution settings, highlighting their limitations in zero-shot conditions.
Findings
Performance gains are highest with same-language datasets.
Effectiveness diminishes when test data is from unseen distributions.
Dataset embeddings are less beneficial in zero-shot scenarios.
Abstract
Recent complementary strands of research have shown that leveraging information on the data source through encoding their properties into embeddings can lead to performance increase when training a single model on heterogeneous data sources. However, it remains unclear in which situations these dataset embeddings are most effective, because they are used in a large variety of settings, languages and tasks. Furthermore, it is usually assumed that gold information on the data source is available, and that the test data is from a distribution seen during training. In this work, we compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting. We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
