Dynamic Evaluation for Oversensitivity in LLMs
Sophia Xiao Pu, Sitao Cheng, Xin Eric Wang, William Yang Wang

TL;DR
This paper introduces a dynamic benchmarking framework and OVERBENCH, a large-scale, evolving dataset collection that assesses oversensitivity in language models, addressing limitations of static benchmarks and capturing emerging defensive behaviors.
Contribution
The paper presents a novel dynamic evaluation framework and OVERBENCH, the first large-scale, evolving benchmark for oversensitivity in LLMs, tailored to model-specific behaviors.
Findings
OVERBENCH contains 450,000 samples from 25 models.
Dynamic datasets reveal vulnerabilities missed by static benchmarks.
Framework enables continuous monitoring of model oversensitivity.
Abstract
Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
