MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory
Ana Carolina Condez, Diogo Tavares, Jo\~ao Magalh\~aes

TL;DR
MoralCLIP introduces a novel multimodal embedding method that incorporates moral foundations, enabling AI systems to interpret and reason about moral content across visual and textual data.
Contribution
It extends vision-language models with explicit moral grounding based on Moral Foundations Theory, using a new dataset and data augmentation for improved moral understanding.
Findings
Explicit moral supervision enhances moral content recognition.
MoralCLIP achieves better cross-modal moral alignment.
The approach supports morally-aware AI applications.
Abstract
Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
