Towards Model-Based Data Acquisition for Subjective Multi-Task NLP   Problems

Kamil Kanclerz; Julita Bielaniewicz; Marcin Gruza; Jan Kocon,; Stanis{\l}aw Wo\'zniak; Przemys{\l}aw Kazienko

arXiv:2312.08198·cs.CL·December 14, 2023·1 cites

Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems

Kamil Kanclerz, Julita Bielaniewicz, Marcin Gruza, Jan Kocon,, Stanis{\l}aw Wo\'zniak, Przemys{\l}aw Kazienko

PDF

Open Access 1 Repo

TL;DR

This paper introduces a model-based data acquisition method for subjective multi-task NLP problems, reducing annotation costs by up to 40% while maintaining knowledge quality, and highlights the importance of diverse data collection.

Contribution

It proposes a novel task-specific annotation selection approach for multi-task NLP, improving efficiency and exploring the impact of data diversity and task relations.

Findings

01

Up to 40% reduction in annotations with negligible knowledge loss

02

Training on model-predicted labels enhances task selection efficiency

03

Diversity in data collection is crucial for effective model training

Abstract

Data annotated by humans is a source of knowledge by describing the peculiarities of the problem and therefore fueling the decision process of the trained model. Unfortunately, the annotation process for subjective natural language processing (NLP) problems like offensiveness or emotion detection is often very expensive and time-consuming. One of the inevitable risks is to spend some of the funds and annotator effort on annotations that do not provide any additional knowledge about the specific task. To minimize these costs, we propose a new model-based approach that allows the selection of tasks annotated individually for each text in a multi-task scenario. The experiments carried out on three datasets, dozens of NLP tasks, and thousands of annotations show that our method allows up to 40% reduction in the number of annotations with negligible loss of knowledge. The results also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clarin-pl/model-based-data-acquisition
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning