A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in   Text Classification Tasks

Fabiano Bel\'em; Washington Cunha; Celso Fran\c{c}a; Claudio Andrade,; Leonardo Rocha; Marcos Andr\'e Gon\c{c}alves

arXiv:2407.17284·cs.LG·July 25, 2024

A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in Text Classification Tasks

Fabiano Bel\'em, Washington Cunha, Celso Fran\c{c}a, Claudio Andrade,, Leonardo Rocha, Marcos Andr\'e Gon\c{c}alves

PDF

TL;DR

This paper introduces DoTCAL, a two-step fine-tuning pipeline leveraging unlabeled data and active learning to improve BERT-based text classification in cold-start scenarios, showing significant performance gains.

Contribution

The paper proposes a novel two-step fine-tuning pipeline, DoTCAL, that reduces labeled data reliance and enhances active learning effectiveness for BERT in cold-start text classification.

Findings

01

DoTCAL outperforms traditional methods with up to 33% higher Macro-F1.

02

BOW and LSI sometimes outperform BERT, especially in low-resource tasks.

03

Using unlabeled data via domain adaptation improves model performance.

Abstract

This is the first work to investigate the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios, where traditional fine-tuning is infeasible due to the absence of labeled data. Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL - that diminishes the reliance on labeled data in AL using two steps: (1) fully leveraging unlabeled data through domain adaptation of the embeddings via masked language modeling and (2) further adjusting model weights using labeled data selected by AL. Our evaluation contrasts BERT-based embeddings with other prevalent text representation paradigms, including Bag of Words (BoW), Latent Semantic Indexing (LSI), and FastText, at two critical stages of the AL process: instance selection and classification. Experiments conducted on eight ATC benchmarks with varying AL budgets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Multi-Head Attention · Residual Connection · Dropout · WordPiece