TL;DR
This paper introduces a conformal prediction method to ensure the factual correctness and logical coherence of language model outputs in reasoning tasks, particularly in mathematical problem solving.
Contribution
It develops a novel approach using split conformal prediction on deducibility graphs to guarantee coherent factuality in language model reasoning outputs.
Findings
Achieves 90% factuality on strict criteria.
Retains 80% of claims while ensuring correctness.
Produces consistent, substantiated reasoning orderings.
Abstract
Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the "factuality" of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, where constituent claims may be evaluated in isolation for factuality, but is not appropriate for reasoning tasks, as steps of a logical argument can be evaluated for correctness only within the context of the claims that precede them. To capture this, we define "coherent factuality" and develop a conformal-prediction-based method to guarantee coherent factuality for language model outputs. Our approach applies split conformal prediction to subgraphs within a "deducibility" graph" that represents…
Peer Reviews
Decision·ICLR 2025 Poster
This paper clearly points out that existing conformal factuality is not appropriate for reasoning tasks, and suggests deducibility graph and conformal prediction with coherence factuality. In addition, this experimentally achieved the desired correction and substantiation by applying conformal prediction to the newly defined coherence factuality. This also proposes a claim-scoring function that considers the graph and reflects the confidence along descendants well.
It seems to be sufficiently appealed that coherence factuality is more necessary for reasoning tasks than independent factuality. However, if bad claims, as mentioned in the sentence below, are accepted because they are consistent, wouldn't that be of no help in resolving hallucination? I think additional explanations about coherence factuality or deducibility more than as defined in the paper. "Our definition of deducibility graphs permits the arbitrary treatment of claims that do not follow fr
Several recent works used conformal prediction to verify the correctness of the generation of LLMs with a strong assumption that the factuality of a claim can be independently evaluated. In order to generalize the method to reasoning domains, where claims need to be substantiated and outputted in a comprehensible order, the paper defines a new notion of factuality ”coherent factuality” and develops a conformal-predictionbased method to guarantee coherent factuality of language model outputs. Th
It is not clear how to create and use $C_{true}$ in the experiments on MATH and FELM datasets. The paper said in Line 150 “In practice, we might choose some reference like Wikipedia or a math textbook as our ground truth”, however, there is no statements about $C_{true}$ in the experiments. The paper uses GPT4o to generate the graphs, but the quality of the graphs is unknown. In addition, the proposed method can obtain both coherent factuality and independent factuality of the LLM output, howe
1. Extending the conformal prediction framework to reasoning problems is an important direction. The idea of considering the dependency structure among the claims is straightforward and effective. 2. The proposed framework is simple to implement and shows stronger performance than baselines in the experiments.
1. This writing of this paper can be substantially improved and in general should be more rigorous. There are a number of writing issue in this paper making this paper a bit hard to understand. To list a few: * One key property of the ideal graph is it uses the "minimal set of the claims". However, this is only mentioned in the appendix. * What is the "graph G" at LINE 350? Is it the corresponding subgraph to each node? * LINE 402 points to appendix F, but appendix F does not contain the p
Videos
