Exploring Imbalanced Annotations for Effective In-Context Learning

Hongfu Gao; Feipeng Zhang; Hao Zeng; Deyu Meng; Bingyi Jing; Hongxin Wei

arXiv:2502.04037·cs.CL·June 2, 2025

Exploring Imbalanced Annotations for Effective In-Context Learning

Hongfu Gao, Feipeng Zhang, Hao Zeng, Deyu Meng, Bingyi Jing, Hongxin Wei

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how class imbalance in datasets affects in-context learning with large language models and proposes a reweighting method called RCB to improve performance by addressing class and conditional biases.

Contribution

The paper introduces RCB, a novel reweighting approach that accounts for class and conditional biases, significantly enhancing ICL performance under imbalanced data conditions.

Findings

01

RCB improves average accuracy by up to 5.42%

02

Class imbalance degrades ICL performance across tasks

03

Class rebalancing alone is insufficient without considering conditional bias

Abstract

Large language models (LLMs) have shown impressive performance on downstream tasks through in-context learning (ICL), which heavily relies on the demonstrations selected from annotated datasets. However, these datasets often exhibit long-tailed class distributions in real-world scenarios, leading to biased demonstration selection. In this work, we show that such class imbalances significantly degrade the ICL performance across various tasks, regardless of selection methods. Moreover, classical rebalancing methods, which focus solely on class weights, yield poor performance due to neglecting condition bias--skewed feature distributions within classes. To address this, we propose Reweighting with Conditional Bias (dubbed RCB), a simple and complementary approach to enhance ICL performance under class imbalance. In particular, RCB estimates conditional bias using a balanced subset and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Clear empirical evidence that pool-level class imbalance hurts ICL across selection methods and that common calibration schemes (CC, DC, Var-IC) fail or even neutralize advanced selectors; figures and tables isolate the effect across models and imbalance ratios. - Simple, method-agnostic reweighting that plugs into existing selectors without model fine-tuning and only requires model outputs; computational overhead is modest and often lower than calibration baselines in reported settings. - S

Weaknesses

- Dependence on building a balanced subset $\mathcal{D}_b$ from tail classes can be impractical when tails are extremely scarce, which the paper acknowledges but does not resolve with alternatives (e.g., generative augmentation, semi-supervised density estimation). - Importance-factor estimation uses Bayesian optimization on $\mathcal{D}_b$ with heuristic initialization; sensitivity to surrogate choices, search budget, noise in evaluation, and cross-dataset transferability of learned weights is

Reviewer 02Rating 2Confidence 5

Strengths

S1: The paper is well-executed experimentally, with comprehensive evaluations across multiple models (OPT, LLaMA, ChatGPT, Gemini), datasets, and imbalance ratios. The experimental protocol is solid and well explained. The explanation and figures are straightforward to understand. S2: The finding that a larger dataset doesn't automatically address the imbalance issue is useful.

Weaknesses

W1: This is my main concern. The problem and solution seems expected and unsurprising. It isn't surprising that class imbalance harms ICL performance. The solution to use a balanced subset to calculate a weighting factor is also hardly novel or particularly interesting. It is unclear why not just used the balanced set for ICL? W2: The novelty seem limited. Importance sampling and reweighting for class imbalance are well known techniques addressing imbalance. W3: The improvements are incrementa

Reviewer 03Rating 4Confidence 4

Strengths

* Clarity and Presentation: The paper is well-written and clearly structured. The motivation, methodology, and results are presented in an organized and easy-to-follow manner. * Comprehensive Evaluation: The experiments are extensive, covering multiple datasets, imbalance ratios, and both open-weight and API-based models. * Simplicity of the Proposed Method: RIF is straightforward to implement and integrates easily with existing demonstration selection methods.

Weaknesses

* Lack of Novelty: The core idea—reweighting samples based on estimated importance factors—is not fundamentally new and aligns closely with well-known class balancing and importance sampling techniques. The contribution seems more incremental than conceptual. * Limited Theoretical Depth: The theoretical justification is fairly standard, and the paper mostly reiterates known results in imbalance learning, adapted to the ICL setting. * Modest Empirical Gains: While improvements are consistent, the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsFocus