No One Size Fits All: QueryBandits for Hallucination Mitigation
Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso

TL;DR
This paper introduces QueryBandits, an adaptive, model-agnostic framework using contextual bandits to dynamically select query-rewrite strategies, significantly reducing hallucinations in large language models without retraining.
Contribution
We propose QueryBandits, a novel online learning approach that optimally chooses query-rewrite strategies for hallucination mitigation in closed-source LLMs, outperforming static policies.
Findings
QueryBandits achieves 87.5% win rate over no-rewrite baseline.
It outperforms static policies by 42.6% and 60.3%.
No single rewrite policy is optimal for all queries.
Abstract
Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed method can be deployed on black-box LLMs with high practicality. It is completely based on query rewriting and online selection and does not rely on access to model parameters or weights. 2. Experimental validation is extensive. The authors verify both the rationality of the modeling strategy and the effectiveness of the proposed method across 16 different scenarios.
1. The theoretical basis for directly associating bandit learning with hallucination mitigation is insufficient. The paper models query rewrite selection as a contextual multi-armed bandit with composite rewards to drive online learning, but lacks proof of formal links between optimal policy existence, convergence, and the minimization of hallucination rates. The relevant arguments are mainly empirical reward separability and AUC tests, not a formal connection to LLM hallucinations. 2. Contribut
1. Conceptual originality: The contextual bandit framing of query rewriting is novel and elegant. 2. Practical relevance: QueryBandits is a plug-and-play method requiring only black-box access, directly addressing a key challenge in closed-model hallucination mitigation. 3. Sound methodology: Reward calibration and validation against human labels show strong discriminative reliability. 4. Interpretability: Per-feature regression analysis (Fig. 5) provides rare interpretability, showing which lin
1. Limited baseline coverage: While the paper’s claim of being model-agnostic is methodologically valid—since the proposed framework does not rely on gradient access or internal model parameters—it lacks empirical comparisons with other strong model-agnostic hallucination mitigation baselines such as Self-Refine (Madaan et al., 2023) and RAG-based rewriting approaches (e.g., Rewrite-Retrieve-Read, Ma et al., 2023). Including, or at least discussing, approximate results from these paradigms would
[S1] The problem of hallucination mitigation is very relevant. [S2] The framing as contextual bandit seems original and appears to deliver gains. [S3] The method is applicable to closed-weight models, which is an advantage.
[W1] There is no analysis of the computational cost. How many forward passes/generated tokens are required to generate the vector of linguistic features for the query? The cost/effectiveness trade off should be discussed. [W2] The comparison with open-weight baselines (Table 5) should use stronger and more recent models. What would be the results with Llama 3.1 405B Instruct? Deltas should be given relative to the no-rewrite condition with the same model, not relative to a much larger model. [
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Topic Modeling
