$\pi$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering

Chao Wan; Albert Gong; Mihir Mishra; Carl-Leander Henneking; Claas Beger; Kilian Q. Weinberger

arXiv:2506.20642·cs.CL·February 20, 2026

$\pi$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering

Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger, Kilian Q. Weinberger

PDF

Open Access 3 Reviews

TL;DR

This paper introduces $$-CoT, a prompting method that uses Prolog-based decomposition of multi-hop questions to improve reasoning accuracy in large language models, especially in retrieval-augmented settings.

Contribution

It proposes a novel Prolog-Initialized Chain-of-Thought prompting strategy that enhances multi-hop reasoning by decomposing questions into logical sub-queries for better model performance.

Findings

01

$$-CoT outperforms standard RAG and in-context CoT on multiple benchmarks.

02

The method reduces reasoning errors caused by circular logic.

03

Sequential sub-query resolution improves multi-hop question-answering accuracy.

Abstract

Chain-of-Thought (CoT) prompting significantly enhances large language models' (LLMs) problem-solving capabilities, but still struggles with complex multi-hop questions, often falling into circular reasoning patterns or deviating from the logical path entirely. This limitation is particularly acute in retrieval-augmented generation (RAG) settings, where obtaining the right context is critical. We introduce Prolog-Initialized Chain-of-Thought ( $π$ -CoT), a novel prompting strategy that combines logic programming's structural rigor with language models' flexibility. $π$ -CoT reformulates multi-hop questions into Prolog queries decomposed as single-hop sub-queries. These are resolved sequentially, producing intermediate artifacts, with which we initialize the subsequent CoT reasoning procedure. Extensive experiments demonstrate that $π$ -CoT significantly outperforms standard RAG and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* The paper is clearly written. * The method is novel and is a useful way to combine the benefits of a symbolic system (prolog) with the knowledge and natural language reasoning abilities of LLMs. * Results are presented with variances and appear mostly significant.

Weaknesses

* The method’s dependence on prolog potentially limits this method to a specific set of problems. * It is not clear what Memento refers to in the tables. Bolding in the tables is also confusing. This makes the results very hard to understand. * Cost/latency is not evaluated.

Reviewer 02Rating 2Confidence 4

Strengths

- The paper presents an interesting approach to addressing the limitations of combining CoT and RAG by incorporating neural-symbolic reasoning through the use of Prolog. - The paper is clearly written, with well-organised explanations and helpful illustrations.

Weaknesses

- The experimental setup appears somewhat arbitrary and selective. For example, in Table 3, the retrieval model differs from that used in Tables 1 and 2. Is there a specific reason for this? It seems that the baselines in Tables 1 and 2 could also be evaluated under the retrieval model of Table 3 for a fairer comparison. Similarly, in Section 5.2 / Table 4, why are the PW-S and PW-M datasets not used in the experiments of Tables 1–3? - The paper lacks robust analysis regarding the Prolog compon

Reviewer 03Rating 4Confidence 4

Strengths

[S1] Interesting conceptual perspective. The perspective of reliable decomposition with formal method is interesting. It offers a principled way to constrain reasoning trajectories while keep the subtasks manageable for LLMs during multi-hop inference. [S2] Writing quality is good. The paper is in general well-written and easy to follow. It also appropriately situates itself among prior works (e.g., IRCoT, Self-Ask, GraphRAG, HippoRAG 2). [S3] Comprehensive evaluation. The evaluation is done o

Weaknesses

[W1] Improvement can be inconsistent across datasets. On real-world datasets (e.g., HotpotQA, MuSiQue), $\pi$-CoT performs comparably to baselines, with statistical significance only on certain datasets like 2WikiMultiHopQA. The claimed “significant outperforming” does not hold uniformly. Also, I think it would be good to also make clear that only prompting-based methods are compared in the table. As latest SOTA on the datasets are way higher. E.g., finetuned approaches on HotpotQA is ~10% highe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNarrative Theory and Analysis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART