MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

Jackson Trager; Francielle Vargas; Diego Alves; Matteo Guida; Mikel K. Ngueajio; Ameeta Agrawal; Yalda Daryani; Farzan Karimi-Malekabadi; Flor Miriam Plaza-del-Arco

arXiv:2506.19073·cs.CL·October 14, 2025

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

Jackson Trager, Francielle Vargas, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Yalda Daryani, Farzan Karimi-Malekabadi, Flor Miriam Plaza-del-Arco

PDF

TL;DR

This paper introduces MFTCXplain, a multilingual dataset for evaluating LLMs' moral reasoning through multi-hop hate speech explanations, highlighting current models' limitations in moral understanding across languages.

Contribution

It presents a new multilingual benchmark dataset with detailed annotations for assessing moral reasoning in LLMs, addressing transparency and cultural diversity gaps.

Findings

01

LLMs perform well in hate speech detection (F1 up to 0.836)

02

LLMs show weak ability to predict moral sentiments (F1 < 0.35)

03

Rationale alignment is limited in underrepresented languages

Abstract

Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanation using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.