No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success
Lukas Rauch, Moritz Wirth, Denis Huseljic, Marek Herde, Bernhard Sick, Matthias A{\ss}enmacher

TL;DR
This paper demonstrates that the success of active learning query strategies heavily depends on the quality of LLM embeddings, highlighting the importance of embedding quality and task context in strategy selection.
Contribution
It systematically evaluates how LLM embedding quality influences active learning strategies across multiple tasks, providing new insights into strategy robustness and effectiveness.
Findings
High-quality embeddings improve early active learning performance.
Strategy effectiveness varies with embedding quality and task.
Badge strategy shows robustness across diverse tasks.
Abstract
The advent of large language models (LLMs) capable of producing general-purpose representations lets us revisit the practicality of deep active learning (AL): By leveraging frozen LLM embeddings, we can mitigate the computational costs of iteratively fine-tuning large backbones. This study establishes a benchmark and systematically investigates the influence of LLM embedding quality on query strategies in deep AL. We employ five top-performing models from the massive text embedding benchmark (MTEB) leaderboard and two baselines for ten diverse text classification tasks. Our findings reveal key insights: First, initializing the labeled pool using diversity-based sampling synergizes with high-quality embeddings, boosting performance in early AL iterations. Second, the choice of the optimal query strategy is sensitive to embedding quality. While the computationally inexpensive Margin…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper stuides the problem of active learning with This paper studies the problem of active learning where the success of query strategies is dictated by the quality of frozen LLM embeddings, which is a practical topic in deep active learning. 2. The paper is generally well presented with good clarity and thus it is easy to follow. 3. The experimental section is detailed, supporting a comprehensive empirical discussion.
1. One concern is about badge setting part. The experiments use training and test sets without a separate validation set. The badge sampling strategy as the central to the proposed method, is performed on the test dataset. This seems to maybe create a potential data leakage problem, as the badge selection mechanism may be indirectly optimizing on test data. However, the paper does not acknowledge or discuss this potential concern. If it does stand as an issue, authors are suggested to provide an
This work gives a fresh perspective by revisiting active learning in the context of modern LLM-based representations and asking whether long-standing assumptions still hold. The authors run a well-controlled set of experiments across tasks, embedding models, and query strategies. The writing is clear and easy to follow. This paper offers useful insights for researchers and practitioners working with data-efficient learning in the LLM era.
1. The evaluation focuses only on text classification. Prior work shows AL behavior varies across tasks like NER, QA, etc, where uncertainty signals and data structure differ. Extending to at least one structured prediction or generative task is important. 2. While the paper convincingly shows that embedding quality affects the performance of AL strategies, the analysis remains insufficient. Should discuss which embedding properties (e.g., cluster tightness, inter-class margin structure) drive t
- The benchmark is technically solid and clearly described. Using frozen embeddings and a fixed classifier effectively isolates the effect of embedding quality on AL. - The experimental design covers a reasonable space of embedders and strategies. - The framework could be useful for future research on deep AL pipelines with LLM features.
- I find the paper lacking in conceptual novelty. The central takeaways that better embeddings help AL, and diversity in initial sampling complements uncertainty-based querying, are intuitive and have been reported before. - Too few IPS strategies are tested. The results hinge heavily on IPS, yet only three methods (Random, CoreSet, TypiClust) are tested. Since CoreSet performs worse than random and TypiClust is the only one that helps, the conclusions around IPS feel narrow. - The benchmark use
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Online Learning and Analytics
