Trust The Typical
Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary

TL;DR
Trust The Typical (T3) introduces a novel safety framework for LLMs that detects unsafe prompts by modeling safe prompt distributions, achieving state-of-the-art results without harmful data training.
Contribution
T3 operationalizes safety as out-of-distribution detection, requiring no harmful examples and generalizing across languages and domains with high efficiency.
Findings
Achieves state-of-the-art performance on 18 safety benchmarks.
Reduces false positives by up to 40x compared to existing methods.
Maintains less than 6% overhead in production settings.
Abstract
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English…
Peer Reviews
Decision·ICLR 2026 Poster
1. the paper is well written and easy to follow 2. the new perspective proposed by the paper is quite novel 3. extensive experiments show that the proposed method is very effective and significantly outperforms baseline methods on various benchmarks. 4. integration into vLLM shows promising application of the method for production use.
1. it may be hard for the method to work well with datasets with a large amount of borderline prompts such as the ones showed in the paper or attack methods that intentionally resemble those in-distribution ones such as [1]. [1] Luo, Xuan, et al. "A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness." arXiv preprint arXiv:2509.14297 (2025).
1. The paper is well written. 2. The theoretical analysis is grounded.
1. If harmful prompts are phrased in statistically typical language and differ only in minor wording, T3’s OOD detector may fail to flag them, missing context-dependent cases. 2. T3’s performance depends on how comprehensively the safe corpus captures benign query diversity; overly narrow or biased data may lead to false alarms on novel but harmless inputs. 3. T3 detects distributional anomalies. Could some forms of undesired content be statistically typical and thus fail to trigger an OOD alar
* T3 is evaluated on many datasets and compared to a wide range of alternative methods, showing its advantage over other methods. * The usage of manifold per-point PRDC metrics and treating harmfulness is a novel idea for addressing safety in LLMs. * The work is well written and is easy to follow.
* I find it confusing that the T3 method is bolded in tables 1 and 2, even when it is not the best-performing method. * The Authors do not provide enough details of how FPR@95TPR is calculated. Some benchmarks that the Authors use only have examples of one class, making the calculation of this metric impossible. * The Authors have not utilized the newest guard models, such as WildGuard[1] and newer versions of Llama Guard. * T3 is not evaluated on more demanding safety datasets such as WildGuard
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques
