Inducing Human-like Biases in Moral Reasoning Language Models
Artem Karpov, Seong Hah Cho, Austin Meek, Raymond Koopmanschap, Lucy, Farnik, Bogdan-Ionut Cirstea

TL;DR
This paper investigates whether fine-tuning large language models on human moral reasoning data, including brain imaging data, can enhance their alignment with human moral cognition, finding limited improvement in BrainScore.
Contribution
It explores the effect of fine-tuning LLMs on moral reasoning and brain data, revealing that larger models perform better but BrainScores do not significantly improve.
Findings
Larger models outperform smaller ones on moral reasoning tasks.
Fine-tuning on fMRI data does not significantly increase BrainScore.
Model accuracy improves with size, but alignment with brain data remains limited.
Abstract
In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychology of Moral and Emotional Judgment
MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Layer Normalization · Adam · Residual Connection · Weight Decay · Softmax · Multi-Head Attention
