OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

Wenbin Hu; Huihao Jing; Haochen Shi; Changxuan Fan; Haoran Li; Yangqiu Song

arXiv:2603.13933·cs.CL·April 17, 2026

OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

Wenbin Hu, Huihao Jing, Haochen Shi, Changxuan Fan, Haoran Li, Yangqiu Song

PDF

1 Datasets

TL;DR

This paper introduces OmniCompliance-100K, a large, rule-grounded safety dataset for LLMs, covering 74 regulations across multiple domains, and evaluates LLMs' safety capabilities using this dataset.

Contribution

It constructs a comprehensive, real-world compliance dataset with 12,985 rules and over 106,000 cases, addressing gaps in existing safety datasets and providing a benchmark for LLM safety evaluation.

Findings

01

Strong alignment between rules and cases confirmed

02

Benchmarking reveals variability in LLM safety across models

03

Insights suggest directions for future LLM safety improvements

Abstract

Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hubin/OmniCompliance100K
dataset· 120 dl
120 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.