Enabling Fine-Grained Operating Points for Black-Box LLMs

Ege Beyazit; KL Navaneet; Prashant Mathur; Roi Blanco; Vidit Bansal; Karim Bouyarmane

arXiv:2510.17727·cs.LG·October 22, 2025

Enabling Fine-Grained Operating Points for Black-Box LLMs

Ege Beyazit, KL Navaneet, Prashant Mathur, Roi Blanco, Vidit Bansal, Karim Bouyarmane

PDF

Open Access 4 Reviews

TL;DR

This paper explores methods to enhance the operational control of black-box LLMs as classifiers, enabling finer decision thresholds without sacrificing performance, by analyzing their output biases and proposing efficient solutions.

Contribution

It introduces novel approaches to significantly increase the diversity of operating points for black-box LLM classifiers without performance loss.

Findings

01

Proposed methods achieve finer-grained control over LLM decision thresholds.

02

Approaches outperform benchmark methods across multiple datasets and models.

03

Analysis reveals biases towards rounded verbalized probabilities in LLM outputs.

Abstract

Black-box Large Language Models (LLMs) provide practical and accessible alternatives to other machine learning methods, as they require minimal labeled data and machine learning expertise to develop solutions for various decision making problems. However, for applications that need operating with constraints on specific metrics (e.g., precision $\geq$ 95%), decision making with black-box LLMs remains unfavorable, due to their low numerical output cardinalities. This results in limited control over their operating points, preventing fine-grained adjustment of their decision making behavior. In this paper, we study using black-box LLMs as classifiers, focusing on efficiently improving their operational granularity without performance loss. Specifically, we first investigate the reasons behind their low-cardinality numerical outputs and show that they are biased towards generating rounded…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

- Paper demonstrates empathy with practitioners who wish to construct PR/ROC charts from non-smooth data.

Weaknesses

- It’s well known in the community that LLMs’ verbalized confidence estimates emphasize round numbers, mimicking everyday speech by humans. In my opinion, this is neither a mystery nor a surprise. - LLMs under evaluation are somewhat dated (Claude 3? I haven’t heard this name in a long time…). - The proposed methodology injects randomness so that verbalized confidences present the appearance of being more granular without actually being more informative. This approach doesn’t address the fundame

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper addresses a novel and underexplored limitation of black-box LLMs — the low cardinality of their verbalized probability outputs — and provides a systematic characterization of this phenomenon across multiple datasets and models. 2. The experimental setup is comprehensive: 11 binary classification datasets and 3 commercial LLMs (Claude, Nova, Qwen), with extensive comparisons against sampling-based uncertainty estimation and confidence elicitation baselines. T3. he results demonstrate

Weaknesses

1. The empirical gains, especially for the upper plot of Figure 4, remain modest, with noticeable variance across splits. The improvement is clearer when aggregating data, but still with high variance. 2. The MLP-based correction module, though simple and effective, operates solely on the verbalized probabilities without leveraging any input-conditional information. A deeper integration with semantic or contextual features might strengthen the generalization argument.

Reviewer 03Rating 4Confidence 3

Strengths

The paper provides a clear and formal definition of Operational Granularity. Under the assumption that $\hat{y}_{i}^{\text{vrb}}$ approximates $p(y_i = 1 \mid x_i)$, the authors present intuitive formulations and well-structured objectives for three proposed methods.

Weaknesses

1. The motivation of the paper is not fully convincing. Since the work focuses on improving *operational granularity* rather than predictive performance, the authors should provide concrete examples or application scenarios where finer operational granularity is crucial — for instance, situations where small changes in decision thresholds have significant real-world impact. 2. The experiments are insufficient to support the paper’s motivation. (a) It remains unclear whether $\hat{y}_{i}^{vrb}$

Reviewer 04Rating 4Confidence 3

Strengths

- Enabling fine-grained control of operating points for black-box LLMs is an important desideratum, and has meaningful ramifications in high-stake decision-making scenarios such as medical treatment. - The paper is mostly well-written, and the authors have gone lengths in providing certain definitions (e.g. PR and ROC). The (extended) literature review is also well-done with many useful references. - The logical flow of the paper makes sense (i.e. EDA / identifying problems -> presenting hypoth

Weaknesses

- The datasets studied in this paper are sourced from well-established benchmarks (e.g. SST-2, BoolQ) which may be contained in the tested models' training set. This may lead to qualitatively different analyses compared to the high-stake decision-making scenarios that the authors are targeting (e.g. medical treatment). In the paper's experiments, models can be confident in their verbalized probabilities compared to the true zero-shot, black-box access scenario that the authors are targeting. Whi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education