Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park, Meeyoung Cha

TL;DR
This paper investigates how uncertainty affects moral decision-making in large language models, revealing that increasing model uncertainty through dropout improves alignment with human moral judgments in ethical dilemmas.
Contribution
It introduces a method to quantify and manipulate moral uncertainty in LLMs using dropout, enhancing their alignment with human moral preferences.
Findings
Model architecture influences moral uncertainty more than moral dimensions.
Dropout increases model entropy, especially mutual information.
Higher mutual information correlates with better human-LLM moral alignment.
Abstract
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
