From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Zhihao Zhang; Yiran Zhang; Xiyue Zhou; Liting Huang; Imran Razzak; Preslav Nakov; Usman Naseem

arXiv:2505.18685·cs.CL·November 26, 2025

From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces MM Health, a comprehensive multimodal dataset with over 34,000 health news articles, including AI-generated content, to improve detection of health misinformation across multiple tasks.

Contribution

The paper presents a large-scale, multimodal health misinformation dataset with both human and AI generated content, addressing gaps in existing datasets and benchmarking current detection models.

Findings

01

Existing models struggle with reliability and origin detection.

02

The dataset covers diverse health topics and includes multimodal information.

03

AI generated content is significantly represented in the dataset.

Abstract

Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zzha6204/MM-Health
dataset· 82 dl
82 dl

Videos

From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation· underline

Taxonomy

TopicsMisinformation and Its Impacts