Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs
Renfei Zhang, Manasa Kaniselvan, Niloofar Mireshghallah

TL;DR
Reinforcement learning enhances hierarchical knowledge traversal in language models, improving reasoning and recall without degrading factual knowledge, by refining procedural navigation skills within the model's parameters.
Contribution
This work demonstrates that RL improves hierarchical traversal abilities in LLMs, challenging the belief that it harms memorized knowledge, and shows structured prompting can recover most performance gaps.
Findings
RL models outperform base and SFT models on knowledge recall tasks.
Structured prompting recovers most of the performance gap in hierarchical traversal.
RL primarily improves procedural navigation rather than factual knowledge representations.
Abstract
Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Thorough literature contextualization of the various results and scoping of the problem is appreciated. 2. Creative writing and interpretation of the results. Well written and motivated from a narrative point-of-view. 3. Large breadth of models are evaluated on two medical related info retrieval settings and 1 gradations based multi-hop retrieval setting on one of the two previously mentioned datasets.
1. Not enough novelty and depth of experimentation and insight for the level of ICLR. The framing and intro doesn't live up to expectations in the later sections of the paper with empirical results and could be significantly developed further. For instance, hierarchical retrieval/navigation of these LLMs, a hypothesis of how they work, isn't mechanistically or phenomenologically studied or validated. 2. The experimental scope is also constrained with respect to the setting: info retrieval in t
1. The idea about RL for LLMs improve factual recall ability by efficiently traverse hierarchical structure in the data to recall relevant information at inference time. is interesting and novel. 2. The emphasis on hierarchical structures maps well to biomedicine/genomics taxonomy. This could be relevant in practical scenarios. 3. The paper is easy to follow with detailed explanation on prompt templates, example outputs. 4. The paper raises another practical takeaway that prompting can substitut
1. Hierarchical recall is not unique to medicine. Many other benchmarks also require hierarchical traversal (e.g., product/category taxonomies, legal codes.) The paper should broaden coverage. 2. While the medical angle is well-motivated, the paper doesn’t clearly articulate how medical hierarchies differ from other hierarchical datasets in ways that specifically stress the hypothesized navigation skill. 3. A qualitative analysis of failure modes for the distill models is suggested since the pap
1. The reframing of RL's benefits from enhanced reasoning to improved knowledge navigation is conceptually novel 2. Within the current scope, the experimental design is comprehensive across multiple dimensions: evaluation spans several model families (DeepSeek, Qwen, Mistral, Llama) at various scales (7B-235B), includes three distinct model enhancement paradigms (instruction-tuning, reasoning-enhancement, distillation), and tests multiple prompt templates across independent runs. 3. If the core
1. The paper's main claim is that RL training enhances knowledge navigation rather than logical reasoning capabilities. However, the main experiments only evaluate on tasks requiring hierarchical knowledge recall (medical codes, patent classifications). Testing exclusively on knowledge recall tasks cannot distinguish whether RL improves: (1) only knowledge navigation, (2) both knowledge navigation and reasoning, or (3) general capability that manifests in recall tasks. In order to support the cl
Clear empirical observation: prompting that explicitly instructs hierarchical traversal narrows large performance gaps otherwise credited to “reasoning” training, which provides a applicable strong prompt templates
The central claim is quite expected and incremental relative to existing literature (e.g., “SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training”). The contributions feel too narrow for ICLR. The work consolidates a known theme rather than advancing a new method or theory. Evidence is concentrated on two taxonomy-style benchmarks (MedConceptsQA, IPC). It is unclear whether findings hold on diverse, open-domain knowledge tasks, multi-hop QA with unstructured relat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
