$\pi$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger, Kilian Q. Weinberger

TL;DR
This paper introduces $$-CoT, a prompting method that uses Prolog-based decomposition of multi-hop questions to improve reasoning accuracy in large language models, especially in retrieval-augmented settings.
Contribution
It proposes a novel Prolog-Initialized Chain-of-Thought prompting strategy that enhances multi-hop reasoning by decomposing questions into logical sub-queries for better model performance.
Findings
$$-CoT outperforms standard RAG and in-context CoT on multiple benchmarks.
The method reduces reasoning errors caused by circular logic.
Sequential sub-query resolution improves multi-hop question-answering accuracy.
Abstract
Chain-of-Thought (CoT) prompting significantly enhances large language models' (LLMs) problem-solving capabilities, but still struggles with complex multi-hop questions, often falling into circular reasoning patterns or deviating from the logical path entirely. This limitation is particularly acute in retrieval-augmented generation (RAG) settings, where obtaining the right context is critical. We introduce Prolog-Initialized Chain-of-Thought (-CoT), a novel prompting strategy that combines logic programming's structural rigor with language models' flexibility. -CoT reformulates multi-hop questions into Prolog queries decomposed as single-hop sub-queries. These are resolved sequentially, producing intermediate artifacts, with which we initialize the subsequent CoT reasoning procedure. Extensive experiments demonstrate that -CoT significantly outperforms standard RAG and…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is clearly written. * The method is novel and is a useful way to combine the benefits of a symbolic system (prolog) with the knowledge and natural language reasoning abilities of LLMs. * Results are presented with variances and appear mostly significant.
* The method’s dependence on prolog potentially limits this method to a specific set of problems. * It is not clear what Memento refers to in the tables. Bolding in the tables is also confusing. This makes the results very hard to understand. * Cost/latency is not evaluated.
- The paper presents an interesting approach to addressing the limitations of combining CoT and RAG by incorporating neural-symbolic reasoning through the use of Prolog. - The paper is clearly written, with well-organised explanations and helpful illustrations.
- The experimental setup appears somewhat arbitrary and selective. For example, in Table 3, the retrieval model differs from that used in Tables 1 and 2. Is there a specific reason for this? It seems that the baselines in Tables 1 and 2 could also be evaluated under the retrieval model of Table 3 for a fairer comparison. Similarly, in Section 5.2 / Table 4, why are the PW-S and PW-M datasets not used in the experiments of Tables 1–3? - The paper lacks robust analysis regarding the Prolog compon
[S1] Interesting conceptual perspective. The perspective of reliable decomposition with formal method is interesting. It offers a principled way to constrain reasoning trajectories while keep the subtasks manageable for LLMs during multi-hop inference. [S2] Writing quality is good. The paper is in general well-written and easy to follow. It also appropriately situates itself among prior works (e.g., IRCoT, Self-Ask, GraphRAG, HippoRAG 2). [S3] Comprehensive evaluation. The evaluation is done o
[W1] Improvement can be inconsistent across datasets. On real-world datasets (e.g., HotpotQA, MuSiQue), $\pi$-CoT performs comparably to baselines, with statistical significance only on certain datasets like 2WikiMultiHopQA. The claimed “significant outperforming” does not hold uniformly. Also, I think it would be good to also make clear that only prompting-based methods are compared in the table. As latest SOTA on the datasets are way higher. E.g., finetuned approaches on HotpotQA is ~10% highe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNarrative Theory and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART
