Enhancing Cost Efficiency in Active Learning with Candidate Set Query

Yeho Gwon; Sehyun Hwang; Hoyoung Kim; Jungseul Ok; Suha Kwak

arXiv:2502.06209·cs.LG·August 20, 2025

Enhancing Cost Efficiency in Active Learning with Candidate Set Query

Yeho Gwon, Sehyun Hwang, Hoyoung Kim, Jungseul Ok, Suha Kwak

PDF

Open Access 3 Reviews

TL;DR

This paper proposes a cost-efficient active learning framework using candidate set queries and conformal prediction to reduce labeling costs significantly, demonstrated on multiple image datasets.

Contribution

Introduces a novel candidate set query method combined with conformal prediction for adaptive, low-cost active learning in image classification.

Findings

01

Reduces labeling cost by 48% on ImageNet64x64

02

Effective and scalable across multiple datasets

03

Improves efficiency of active learning process

Abstract

This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64. The project page can be found at…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. This paper introduces a novel approach called Candidate Set Query (CSQ), which effectively reduces labeling costs by narrowing down the candidate classes presented to annotators, thereby minimizing annotation time. 2. The proposed method leverages conformal prediction to dynamically produce accurate candidate labels based on a cost-efficient data acquisition function. This function prioritizes samples with high information gain, leading to greater efficiency and reduced labeling costs. 3. The

Weaknesses

1. The rationale behind the cost-efficient acquisition function in Eq. (8) needs to be further explained. Additional motivation and explanation for this function are recommended. 2. As shown in Fig. 9a, the performance is sensitive to the hyperparameter d. Providing guidelines for setting this parameter to an appropriate range on different datasets would be beneficial. 3. In realistic scenarios, the samples with high uncertainty waiting to be annotated can be divided into two groups based on the

Reviewer 02Rating 5Confidence 4

Strengths

1. The content of the paper is well presented. 2. The paper studies the cost of AL query in a more realistic way and proposes a solution for reducing the cost by candidate set query. 3. The candidate set is formed by conformal prediction and the candidate labels are related to the expected information gain with cost considerations.

Weaknesses

The proposed method still depends on the conformal prediction and the calibration set to determine the confidence level. It is a realistic solution however not guaranteed to be theoretically sound. The convergence can not be obtained in a proper label complexity analysis. Similarly, the labeling cost assumption in Theorem 3.1 is only a rough approximation.

Reviewer 03Rating 6Confidence 4

Strengths

1. The motivation for this paper is clear, and the paper proposes a novel framework of high significance 2. The paper presents a solid theoretical framework that is thoroughly explained and mostly straightforward to follow 3. The framework is benchmarked across 3 well-known datasets, empirically demonstrating the effectiveness of the method 4. Thorough ablations studies were conducted to highlight the significance of each component of the framework

Weaknesses

1. The benchmarks are conducted on very similar datasets (CIFAR-10, CIFAR-100, and ImageNet64x64 are all image classification datasets), and also only compares against a small number of baseline AL methods. It is unclear if the results will generalize well across different datasets and domains, and if more advanced underlying AL acquisition methods are used 2. The paper does not consider the implication of real-world datasets, such as those containing label noise, imbalance classes etc might imp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Educational Technology and Assessment · Intelligent Tutoring Systems and Adaptive Learning