TL;DR
DAMR introduces an efficient, adaptive framework for knowledge graph question answering that combines LLM-guided MCTS with a lightweight scorer, improving accuracy and reducing computational costs over existing methods.
Contribution
The paper proposes DAMR, a novel MCTS-based reasoning framework with adaptive path evaluation and dynamic pseudo-path refinement for improved KGQA performance.
Findings
DAMR outperforms state-of-the-art methods on multiple benchmarks.
The lightweight Transformer scorer effectively captures semantic shifts during reasoning.
Adaptive path evaluation reduces search space and improves answer accuracy.
Abstract
Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Existing methods primarily follow either the retrieve-then-reason paradigm, which relies on Graph Neural Networks or heuristic rules to extract static candidate paths, or dynamic path generation strategies that employ LLMs with prompting to jointly perform retrieval and reasoning. However, the former lacks adaptability due to static path extraction and the absence of contextual refinement, while the latter suffers from high computational costs and limited evaluation accuracy because of their dependence on fixed scoring functions and repeated LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework…
Peer Reviews
Decision·ICLR 2026 Poster
1. Dynamic Pseudo-Path Refinement: The method innovates by using high-confidence partial paths from MCTS rollouts to generate pseudo labels for continual fine-tuning , thus improving generalizability and adapting to the non-stationary search space. 2. Rigorous Empirical Validation: Extensive benchmarks across both standard datasets, with direct comparisons to at least 20 strong baselines are provided in Table 1, consistently showing DAMR outperforming all competitors.
1. It is unclear under what distributional shifts the scorer avoids reinforcing suboptimal trajectories. Since the path scorer is continually adapted with self-generated pseudo-paths, there is risk of feedback loops or bias accumulation, especially if the LLM suggestions are systematically biased early in training. 2. Scalability and practicality on large KGs not addressed. All experiments are conducted on localized subgraphs derived from WebQSP and CWQ. The scalability of DAMR for web-scale or
1. The overall approach is novel to me, though not entirely new due to extensive existing efforts in this field of research. The authors describe a clear step-wise procedure with sufficient details to show how the approach works. The approach presents as a sound solution to address the identified limitations of the existing approaches. 2. Effectively modularizes reasoning by limiting the LLM's role to an initial, high-leverage search guidance step, significantly reducing computational overhead.
1. The comparative discussion against related work could be strengthened to reveal more details about the rationale of designs in the proposed framework. 2. The selection of baseline methods in Table 2 should be discussed. More specifically, why are the three baseline methods selected in particular for the computational efficiency comparison? 3. A clear mapping between the technical components and the advantages/edge achieved by the proposed framework might be better explained, demanding stud
1. Uses the LLM only for expansion and a small Transformer for evaluation, cutting LLM calls by >50% and tokens by ~75% without hurting accuracy, lowering cost/latency and allowing easy module swaps under resource limits. 2. Employs a cross-attention scorer conditioned on the question to model relation sequences hop-by-hop, capturing semantic buildup and constraints for steadier rankings and better generalization than a general LLM scorer. 3. Converts promising/contrasting partial paths into pai
1. Reporting on 1,000 uniformly sampled test questions rather than full official splits inflates variance and hinders comparability; the absence of confidence intervals, significance tests, and an error taxonomy (e.g., compositional, comparative failures) weakens claims of robustness and external validity. 2. Dynamic pseudo-labeling lacks stability analysis: pair generation driven by search stats is not evaluated for convergence or early-noise sensitivity, and safeguards (confidence margins, tem
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
