ConBatch-BAL: Batch Bayesian Active Learning under Budget Constraints

Pablo G. Morato; Charalampos P. Andriotis; Seyran Khademi

arXiv:2507.04929·cs.LG·July 8, 2025

ConBatch-BAL: Batch Bayesian Active Learning under Budget Constraints

Pablo G. Morato, Charalampos P. Andriotis, Seyran Khademi

PDF

3 Reviews

TL;DR

This paper introduces two Bayesian active learning strategies designed for batch selection under budget constraints, utilizing uncertainty metrics from Bayesian neural networks to optimize data annotation costs in real-world geospatial applications.

Contribution

It proposes novel ConBatch-BAL strategies with dynamic thresholding and greedy approaches, tailored for constrained batch active learning in practical scenarios.

Findings

01

ConBatch-BAL reduces active learning iterations and costs.

02

Strategies outperform random baseline in real-world datasets.

03

Effective under various budget and cost scenarios.

Abstract

Varying annotation costs among data points and budget constraints can hinder the adoption of active learning strategies in real-world applications. This work introduces two Bayesian active learning strategies for batch acquisition under constraints (ConBatch-BAL), one based on dynamic thresholding and one following greedy acquisition. Both select samples using uncertainty metrics computed via Bayesian neural networks. The dynamic thresholding strategy redistributes the budget across the batch, while the greedy one selects the top-ranked sample at each step, limited by the remaining budget. Focusing on scenarios with costly data annotation and geospatial constraints, we also release two new real-world datasets containing geolocated aerial images of buildings, annotated with energy efficiency or typology classes. The ConBatch-BAL strategies are benchmarked against a random acquisition…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

The paper proposed an algorithm addressing the budget-constrained dynamic querying algorithm, which can be applied to the examined datasets. The analytic results show the trade-off between budget and informativeness (measures by mutual information). The appropriate strategy for reducing the labeling costs and gaining a large information under budget constraints is established and shows a potential to be applied in the real-world problem.

Weaknesses

$\bullet$ The main issue is concerning $c(\bf x)$. It seems that there is no general $c(\bf x)$ to be applied for various datasets such as tablet datasets, conventional image datasets, and so on. In the paper, only geospatial constraints are related to $c(\bf x)$. It weakens the contribution of the paper. When the image datasets can be considered, can the variance of each pixel be $c(\bf x)$? At least, the more general criteria for the $c(\bf x)$ can be required. $\bullet$ Second, the range

Reviewer 02Rating 5Confidence 3

Strengths

Diverse datasets and interesting applications. This paper's innovations lie in introducing dynamic thresholding for addressing varying annotation costs problems in AL. This problem definition is clear and worth studying. The experiments datasets are diverse, as they are not limited to typical vision datasets (CIFAR10, CIFAR100, tinyimageNet) but use aerial images.

Weaknesses

1. The authors shall provide more hyperparameter tuning/ablation study on Section 4.1 on how to choose dynamic thresholding. The initial threshold, c_th,1 = c_max / n_max, is determined by the total budget divided by the maximum allowed batch steps. This choice is intuitive, but it may not adapt well to scenarios with highly varying annotation costs or where uncertainty scores vary significantly from step to step. Can you conduct an ablation study on the impact of different n_max values could

Reviewer 03Rating 3Confidence 4

Strengths

The paper addresses the practical challenge of batch Bayesian active learning under constraints on both batch size and cost. The two real-world datasets introduced offer valuable resources for evaluating active learning solutions with geolocated points.

Weaknesses

The paper would benefit from a clearer and more detailed description of the method and an improvement of the technical contributions. 1. While the objective is to achieve high accuracy on the test set, the test set itself has not been utilized in the the solution. If the goal is to improve test set accuracy, why not consider incorporating mutual information with the classification of points in the test set, similar to the approach in the well-known work by Andreas Krause, Near-Optimal Sensor P

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.