No One Size Fits All: QueryBandits for Hallucination Mitigation

Nicole Cho; William Watson; Alec Koppel; Sumitra Ganesh; Manuela Veloso

arXiv:2602.20332·cs.CL·February 25, 2026

No One Size Fits All: QueryBandits for Hallucination Mitigation

Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso

PDF

Open Access 3 Reviews

TL;DR

This paper introduces QueryBandits, an adaptive, model-agnostic framework using contextual bandits to dynamically select query-rewrite strategies, significantly reducing hallucinations in large language models without retraining.

Contribution

We propose QueryBandits, a novel online learning approach that optimally chooses query-rewrite strategies for hallucination mitigation in closed-source LLMs, outperforming static policies.

Findings

01

QueryBandits achieves 87.5% win rate over no-rewrite baseline.

02

It outperforms static policies by 42.6% and 60.3%.

03

No single rewrite policy is optimal for all queries.

Abstract

Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The proposed method can be deployed on black-box LLMs with high practicality. It is completely based on query rewriting and online selection and does not rely on access to model parameters or weights. 2. Experimental validation is extensive. The authors verify both the rationality of the modeling strategy and the effectiveness of the proposed method across 16 different scenarios.

Weaknesses

1. The theoretical basis for directly associating bandit learning with hallucination mitigation is insufficient. The paper models query rewrite selection as a contextual multi-armed bandit with composite rewards to drive online learning, but lacks proof of formal links between optimal policy existence, convergence, and the minimization of hallucination rates. The relevant arguments are mainly empirical reward separability and AUC tests, not a formal connection to LLM hallucinations. 2. Contribut

Reviewer 02Rating 6Confidence 3

Strengths

1. Conceptual originality: The contextual bandit framing of query rewriting is novel and elegant. 2. Practical relevance: QueryBandits is a plug-and-play method requiring only black-box access, directly addressing a key challenge in closed-model hallucination mitigation. 3. Sound methodology: Reward calibration and validation against human labels show strong discriminative reliability. 4. Interpretability: Per-feature regression analysis (Fig. 5) provides rare interpretability, showing which lin

Weaknesses

1. Limited baseline coverage: While the paper’s claim of being model-agnostic is methodologically valid—since the proposed framework does not rely on gradient access or internal model parameters—it lacks empirical comparisons with other strong model-agnostic hallucination mitigation baselines such as Self-Refine (Madaan et al., 2023) and RAG-based rewriting approaches (e.g., Rewrite-Retrieve-Read, Ma et al., 2023). Including, or at least discussing, approximate results from these paradigms would

Reviewer 03Rating 4Confidence 3

Strengths

[S1] The problem of hallucination mitigation is very relevant. [S2] The framing as contextual bandit seems original and appears to deliver gains. [S3] The method is applicable to closed-weight models, which is an advantage.

Weaknesses

[W1] There is no analysis of the computational cost. How many forward passes/generated tokens are required to generate the vector of linguistic features for the query? The cost/effectiveness trade off should be discussed. [W2] The comparison with open-weight baselines (Table 5) should use stronger and more recent models. What would be the results with Llama 3.1 405B Instruct? Deltas should be given relative to the no-rewrite condition with the same model, not relative to a much larger model. [

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Topic Modeling