Investigating Multi-source Active Learning for Natural Language   Inference

Ard Snijders; Douwe Kiela; Katerina Margatina

arXiv:2302.06976·cs.CL·February 15, 2023

Investigating Multi-source Active Learning for Natural Language Inference

Ard Snijders, Douwe Kiela, Katerina Margatina

PDF

Open Access 1 Repo

TL;DR

This paper examines the challenges of multi-source active learning in NLP, revealing that common strategies struggle with outliers and source variability, but can improve with outlier removal and difficulty-aware testing.

Contribution

It demonstrates the limitations of existing active learning methods in multi-source settings and proposes analysis techniques to understand and improve their performance.

Findings

01

Uncertainty-based strategies perform poorly with outliers in multi-source data.

02

Removing outliers allows strategies to outperform random selection.

03

Different sources produce varying types of outliers and learnability challenges.

Abstract

In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalization. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asnijders/multi_source_al
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Topic Modeling

Methodsfail · Test