The Moral Gap of Large Language Models

Maciej Skorski; Alina Landowska

arXiv:2507.18523·cs.CL·July 25, 2025

The Moral Gap of Large Language Models

Maciej Skorski, Alina Landowska

PDF

TL;DR

This paper compares large language models and fine-tuned transformers in moral reasoning tasks, revealing significant performance gaps and highlighting the superiority of task-specific fine-tuning over prompting.

Contribution

It provides the first comprehensive comparison of LLMs and fine-tuned models on moral reasoning across social media datasets, emphasizing the limitations of LLMs.

Findings

01

LLMs show high false negative rates in moral detection

02

Fine-tuned models outperform LLMs in moral reasoning tasks

03

Prompt engineering alone is insufficient for accurate moral content detection

Abstract

Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.