T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition
Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu

TL;DR
This paper introduces VHD11K, a large, diverse multimodal dataset for harmful content detection, along with a novel multi-agent VQA annotation framework that enhances annotation reliability and improves harmfulness recognition methods.
Contribution
The paper presents a comprehensive harmful dataset covering diverse categories and a multi-agent VQA annotation process, advancing the generalizability and accuracy of harmful content detection.
Findings
VHD11K aligns well with human annotations, ensuring reliability.
The dataset reveals limitations of existing detection methods.
It improves harmfulness recognition performance over baseline datasets.
Abstract
To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSleep and Work-Related Fatigue · Fire Detection and Safety Systems
