Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems
Kamil Kanclerz, Julita Bielaniewicz, Marcin Gruza, Jan Kocon,, Stanis{\l}aw Wo\'zniak, Przemys{\l}aw Kazienko

TL;DR
This paper introduces a model-based data acquisition method for subjective multi-task NLP problems, reducing annotation costs by up to 40% while maintaining knowledge quality, and highlights the importance of diverse data collection.
Contribution
It proposes a novel task-specific annotation selection approach for multi-task NLP, improving efficiency and exploring the impact of data diversity and task relations.
Findings
Up to 40% reduction in annotations with negligible knowledge loss
Training on model-predicted labels enhances task selection efficiency
Diversity in data collection is crucial for effective model training
Abstract
Data annotated by humans is a source of knowledge by describing the peculiarities of the problem and therefore fueling the decision process of the trained model. Unfortunately, the annotation process for subjective natural language processing (NLP) problems like offensiveness or emotion detection is often very expensive and time-consuming. One of the inevitable risks is to spend some of the funds and annotator effort on annotations that do not provide any additional knowledge about the specific task. To minimize these costs, we propose a new model-based approach that allows the selection of tasks annotated individually for each text in a multi-task scenario. The experiments carried out on three datasets, dozens of NLP tasks, and thousands of annotations show that our method allows up to 40% reduction in the number of annotations with negligible loss of knowledge. The results also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning
