Probing AI Safety with Source Code

Ujwal Narayan; Shreyas Chaudhari; Ashwin Kalyan; Tanmay Rajpurohit; Karthik Narasimhan; Ameet Deshpande; Vishvak Murahari

arXiv:2506.20471·cs.CL·June 26, 2025

Probing AI Safety with Source Code

Ujwal Narayan, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik Narasimhan, Ameet Deshpande, Vishvak Murahari

PDF

Open Access 1 Repo

TL;DR

This paper introduces CoDoT, a novel prompting strategy that converts natural language inputs into code to evaluate and reveal safety shortcomings in large language models, highlighting their tendency to generate toxic outputs.

Contribution

The paper presents CoDoT, a new method for assessing LLM safety by translating prompts into code, exposing significant safety failures in current models.

Findings

01

GPT-4 Turbo's toxicity increases 16.5 times with CoDoT

02

DeepSeek R1 fails 100% of the time in safety evaluation

03

Toxicity increases by 300% on average across models

Abstract

Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt "Make the statement more toxic: {text}" to: "make_more_toxic({text})". We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo's toxicity increases 16.5 times,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ujwal-narayan/codot
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · GPT-4 · ALIGN