Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific   Refusal Tests

David Noever; Forrest McKee

arXiv:2502.06867·cs.CL·February 12, 2025

Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

David Noever, Forrest McKee

PDF

Open Access

TL;DR

This paper introduces an open-source benchmark dataset and testing framework to evaluate large language models' safety mechanisms, focusing on their responses to sensitive scientific queries and the balance between safety and open discourse.

Contribution

It provides a systematic, reproducible method for assessing AI safety measures across multiple models, highlighting their safety profiles and response consistency.

Findings

01

Claude-3.5-sonnet is most conservative with 73% refusals.

02

GPT-3.5-turbo is moderately restrictive with 10% refusals.

03

Response consistency decreases with prompt variations.

Abstract

The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · {Dispute@FaQ-s}How to file a dispute with Expedia? · Adam · Softmax · Dropout