RU-AI: A Large Multimodal Dataset for Machine-Generated Content   Detection

Liting Huang; Zhihao Zhang; Yiran Zhang; Xiyue Zhou; Shoujin Wang

arXiv:2406.04906·cs.CV·February 19, 2025

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

Liting Huang, Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Shoujin Wang

PDF

Open Access 1 Repo 2 Models 2 Datasets

TL;DR

This paper introduces RU-AI, a large multimodal dataset for detecting machine-generated content across text, image, and voice, aiming to improve detection methods amidst the rise of generative AI models.

Contribution

The creation of RU-AI, a comprehensive multimodal dataset with over 1.4 million instances, including noise variants, to facilitate research in machine-generated content detection.

Findings

01

Current state-of-the-art models struggle with accuracy and robustness on RU-AI.

02

The dataset highlights the need for improved detection techniques for multimodal AI-generated content.

03

Extensive experiments demonstrate the challenges in existing detection methods.

Abstract

The recent generative AI models' capability of creating realistic and human-like content is significantly transforming the ways in which people communicate, create and work. The machine-generated content is a double-edged sword. On one hand, it can benefit the society when used appropriately. On the other hand, it may mislead people, posing threats to the society, especially when mixed together with natural content created by humans. Hence, there is an urgent need to develop effective methods to detect machine-generated content. However, the lack of aligned multimodal datasets inhibited the development of such methods, particularly in triple-modality settings (e.g., text, image, and voice). In this paper, we introduce RU-AI, a new large-scale multimodal dataset for robust and effective detection of machine-generated content in text, image and voice. Our dataset is constructed on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhihaozhang97/ru-ai
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies