Do explanations generalize across large reasoning models?
Koyena Pal, David Bau, Chandan Singh

TL;DR
This paper investigates whether explanations generated by large reasoning models (LRMs) generalize across different models, finding that they often do and that certain training methods enhance this generalization, but caution is advised in interpreting these explanations.
Contribution
The study introduces a framework for evaluating explanation generalization across LRMs and proposes a sentence-level ensembling method to improve consistency.
Findings
Explanations often generalize across LRMs.
Reinforcement learning enhances explanation generalization.
Ensembling strategies improve answer consistency.
Abstract
Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
