When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning
Yann Berthelot, Philippe Preux, Riad Akrour

TL;DR
This paper compares various query-time expert-guided reinforcement learning methods on a shared benchmark, revealing failure modes, proposing a decision rule based on pre-training observables, and introducing EDGE as a demonstration of exploitability.
Contribution
It provides a unified benchmark, taxonomy, and decision rule for query-time RL with imperfect experts, addressing gaps in prior isolated evaluations.
Findings
Three failure modes identified: critic blind spot, residual saturation, buffer poisoning.
No single method dominates across all regimes; performance varies with expert quality.
A testable decision rule based on pre-training observables is proposed.
Abstract
Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
