Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning   Strategies are not Better than Random Selection

Sophia Althammer; Guido Zuccon; Sebastian Hofst\"atter; Suzan; Verberne; Allan Hanbury

arXiv:2309.06131·cs.IR·September 13, 2023

Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection

Sophia Althammer, Guido Zuccon, Sebastian Hofst\"atter, Suzan, Verberne, Allan Hanbury

PDF

Open Access 1 Repo

TL;DR

This paper evaluates active learning strategies for fine-tuning pretrained language model rankers, finding that current strategies do not outperform random selection and often incur higher costs, highlighting the need for better data selection methods.

Contribution

The study systematically compares active learning strategies with random selection for fine-tuning PLM-based rankers, revealing their limitations and the existence of more effective data subsets.

Findings

01

AL strategies do not significantly outperform random selection.

02

AL strategies often require more annotation effort and cost.

03

Effective data subsets exist but are not identified by current AL methods.

Abstract

Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM-based rankers under limited training data and budget. We investigate two scenarios: fine-tuning a ranker from scratch, and domain adaptation starting with a ranker already fine-tuned on general data, and continuing fine-tuning on a target dataset. We observe a great variability in effectiveness when fine-tuning on different randomly selected subsets of training data. This suggests that it is possible to achieve effectiveness gains by actively selecting a subset of the training data that has…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sophiaalthammer/alforrankers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms