Reasoning Elicitation in Language Models via Counterfactual Feedback
Alihan H\"uy\"uk, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, Javier, Gonz\'alez

TL;DR
This paper introduces new metrics and fine-tuning methods to enhance the causal reasoning abilities of language models, especially in counterfactual question answering, and evaluates their effectiveness in realistic scenarios.
Contribution
It proposes novel metrics for assessing reasoning in language models and develops fine-tuning approaches to improve their causal reasoning capabilities.
Findings
Fine-tuning improves reasoning accuracy in counterfactual questions.
New metrics better capture reasoning abilities than traditional metrics.
Models show enhanced generalization in reasoning tasks after fine-tuning.
Abstract
Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and…
Peer Reviews
Decision·ICLR 2025 Oral
* Introducing more finegrained metrics: The proposed metrics for causal consistency are a valuable contribution, addressing a limitation of existing evaluation methods that focus primarily on accuracy. * Generalization modes: The proposed classes for generalization modes provide a structured framework for evaluating reasoning transfer. * Experiments: The paper includes a comprehensive set of experiments, including a hand-crafted puzzle and real-world problems.
* The choice of a very small LLM (Phi-3 mini) with limited reasoning capabilities makes it difficult to draw any conclusions for stronger models that are more commonly used for reasoning tasks. * Clarity and Presentation: While the paper focuses on an interesting problem and makes interesting suggestions, the presentation could be significantly improved. I found the formal definitions and descriptions of the methods a bit difficult to follow. * Analysis of generalization: The analysis of the
Causal reasoning is an important problem and Pearl's approach of using counterfactuals to do causal reasoning is sound and well studied. There is indeed necessity of more fine grained metrics when evaluating reasoning ability of LLMs and the papers efforts towards that is a plus. The paper presents several fine tuning techniques and evaluate them using their proposed metrics with respect to several examples. The evaluations show that their fine tuning methods lead to better performance with res
The examples that are studied in the paper are very direct in terms of language and LLMs can translate them to a formal representation (LLMs are good at that) and then call a dedicated solver and thus achieve near 100% accuracy. What is the motivation then of doing it the way that is proposed in the paper? Why is your approach preferable to using a formal representation and a dedicated solver? That motivation must match with the way the examples and evaluation data are presented and studied.
I see a lot of strengths in this paper, including: - The paper presents an interesting framework for fine-tuning LLMs with counterfactual feedback, which is highly relevant and original. - The quality of the paper is very good: it is very well structured, organized, and it integrates a multiplicity of concepts, scenarios and metrics for the problem at hand. In this sense, the paper is beyond complete. - The significance of the paper appears to be very relevant for researchers in causal reaso
I see very few weaknesses in this paper. Here is a couple of things I would improve: - Organization of the paper: the placement of related work is a bit odd - perhaps moving it to the end of the paper would help. - Conclusion: the limitations in the conclusion feel a bit 'incomplete', in the sense that these leave room open for further research of non-binary representations of causes and effects. Can you give some hints on how the proposed method could be adapted in this case? - There is a typ
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsBalanced Selection
