MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Ana Carolina Condez; Diogo Tavares; Jo\~ao Magalh\~aes

arXiv:2506.05696·cs.CV·October 31, 2025

MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Ana Carolina Condez, Diogo Tavares, Jo\~ao Magalh\~aes

PDF

TL;DR

MoralCLIP introduces a novel multimodal embedding method that incorporates moral foundations, enabling AI systems to interpret and reason about moral content across visual and textual data.

Contribution

It extends vision-language models with explicit moral grounding based on Moral Foundations Theory, using a new dataset and data augmentation for improved moral understanding.

Findings

01

Explicit moral supervision enhances moral content recognition.

02

MoralCLIP achieves better cross-modal moral alignment.

03

The approach supports morally-aware AI applications.

Abstract

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.