The Moral Gap of Large Language Models
Maciej Skorski, Alina Landowska

TL;DR
This paper compares large language models and fine-tuned transformers in moral reasoning tasks, revealing significant performance gaps and highlighting the superiority of task-specific fine-tuning over prompting.
Contribution
It provides the first comprehensive comparison of LLMs and fine-tuned models on moral reasoning across social media datasets, emphasizing the limitations of LLMs.
Findings
LLMs show high false negative rates in moral detection
Fine-tuned models outperform LLMs in moral reasoning tasks
Prompt engineering alone is insufficient for accurate moral content detection
Abstract
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
