Taking into Account the Differences between Actively and Passively   Acquired Data: The Case of Active Learning with Support Vector Machines for   Imbalanced Datasets

Michael Bloodgood; K. Vijay-Shanker

arXiv:1409.4835·cs.LG·September 18, 2014·2 cites

Taking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced Datasets

Michael Bloodgood, K. Vijay-Shanker

PDF

Open Access

TL;DR

This paper explores how active learning with cost-weighted SVMs can better handle imbalanced datasets by estimating overall corpus imbalance from small unbiased samples, improving over traditional passive learning methods.

Contribution

It introduces the InitPA method that adjusts cost models during active learning based on corpus imbalance estimates, addressing limitations of passive learning approaches.

Findings

01

InitPA improves imbalance handling in active learning.

02

Active learning with InitPA outperforms passive learning in imbalanced scenarios.

03

The method is effective for high-imbalance datasets in HLT tasks.

Abstract

Actively sampled data can have very different characteristics than passively sampled data. Therefore, it's promising to investigate using different inference procedures during AL than are used during passive learning (PL). This general idea is explored in detail for the focused case of AL with cost-weighted SVMs for imbalanced data, a situation that arises for many HLT tasks. The key idea behind the proposed InitPA method for addressing imbalance is to base cost models during AL on an estimate of overall corpus imbalance computed via a small unbiased sample rather than the imbalance in the labeled training data, which is the leading method used during PL.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Text and Document Classification Technologies · Imbalanced Data Classification Techniques