T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness   Recognition

Chen Yeh; You-Ming Chang; Wei-Chen Chiu; Ning Yu

arXiv:2409.19734·cs.CV·October 3, 2024

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces VHD11K, a large, diverse multimodal dataset for harmful content detection, along with a novel multi-agent VQA annotation framework that enhances annotation reliability and improves harmfulness recognition methods.

Contribution

The paper presents a comprehensive harmful dataset covering diverse categories and a multi-agent VQA annotation process, advancing the generalizability and accuracy of harmful content detection.

Findings

01

VHD11K aligns well with human annotations, ensuring reliability.

02

The dataset reveals limitations of existing detection methods.

03

It improves harmfulness recognition performance over baseline datasets.

Abstract

To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nctu-eva-lab/vhd11k
noneOfficial

Datasets

Videos

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition· slideslive

Taxonomy

TopicsSleep and Work-Related Fatigue · Fire Detection and Safety Systems