Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

Jian-Qiao Zhu; Hanbo Xie; Dilip Arumugam; Robert C. Wilson; Thomas L. Griffiths

arXiv:2505.11614·cs.AI·February 3, 2026

Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

Jian-Qiao Zhu, Hanbo Xie, Dilip Arumugam, Robert C. Wilson, Thomas L. Griffiths

PDF

Open Access 3 Reviews

TL;DR

This paper investigates using reinforcement learning to fine-tune large language models so they can both predict human decision-making and generate interpretable natural language explanations of cognitive processes.

Contribution

It introduces a reinforcement learning approach to train LLMs for dual tasks of prediction and explanation of human decisions, enhancing interpretability.

Findings

01

LLMs can generate high-quality explanations of human risky choices.

02

Reinforcement learning improves the alignment of explanations with decision predictions.

03

The method achieves strong predictive performance and interpretability in cognitive modeling.

Abstract

A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models--capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper has a clear goal, and that goal is well executed. I think this type of modeling and analysis efforts will be interesting for the cognitive science community. The evaluations and ablations are quite comprehensive. The Appendix, in particular, contains several insightful analyses. The two that are very important for the paper’s message are: - The ablation experiment in C.3, where the authors swap the CoT between the RL and the base models to show the importance of these traces. - The e

Weaknesses

I’m not convinced that RL is essential for this pipeline. The development of predictive and explanatory reasoning traces are attributed to RL by making comparisons to the base model. However, perhaps SFT (either Centaur style or full) is also sufficient to develop such traces. If this is the case, SFT may be preferred over RL fine-tuning given a) reduced computational costs during training and b) difficulties around getting RL to work, as the authors have also pointed out in section F. **If the

Reviewer 02Rating 4Confidence 4

Strengths

Strength: 1 The topic is interesting, and to my knowledge, this is the first work that applies RL to analyze the risky decision of human behavior. 2 The experiments are abundant. 3 Figure 1 provides a clear and concrete example that effectively illustrates the task.

Weaknesses

Weakness: 1 Although the paper integrates RL to explain human decision-making and defines a reward function in Formula (1), it lacks an in-depth analysis of the task. As a result, the contribution appears incremental, and the work reads more like an experimental report than a research paper. 2 The paper leans more toward psychological or cognitive science research than computer science, as the main contributions involve cognitive interpretation rather than methodological innovation. Moreover

Reviewer 03Rating 2Confidence 4

Strengths

The problem of explaining human decisions is interesting and worth exploring.

Weaknesses

I do not see the contribution of this paper. The outcome reward that compares prediction correctness is standard in RLVR, the GRPO is directly borrowed from literature, the "step-by-step" prompt is also directly borrowed from literature. In that sense, I do not see anything new or unique that is proposed by this paper. Additionally, this paper focuses on the interpretability of human decisions, however, I do not see any special design for this purpose. Specifically, the paper proposes an outco

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education