Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski; Yuval Kainan

arXiv:2510.26847·cs.CR·November 3, 2025

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski, Yuval Kainan

PDF

Open Access 1 Datasets

TL;DR

This paper introduces CPT-Filtering, a simple and effective method to detect obfuscated prompts in large language models by analyzing the average characters per token, significantly improving safety guardrails against jailbreak attacks.

Contribution

The paper proposes a novel, model-agnostic technique using characters per token to identify encoded malicious prompts, requiring negligible computational costs.

Findings

01

High accuracy in detecting encoded prompts across various schemes

02

Robust performance even on very short inputs

03

Applicable for real-time filtering and offline data curation

Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jfrog/obfuscation-identification
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Authorship Attribution and Profiling