Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
Shaked Zychlinski, Yuval Kainan

TL;DR
This paper introduces CPT-Filtering, a simple and effective method to detect obfuscated prompts in large language models by analyzing the average characters per token, significantly improving safety guardrails against jailbreak attacks.
Contribution
The paper proposes a novel, model-agnostic technique using characters per token to identify encoded malicious prompts, requiring negligible computational costs.
Findings
High accuracy in detecting encoded prompts across various schemes
Robust performance even on very short inputs
Applicable for real-time filtering and offline data curation
Abstract
Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Authorship Attribution and Profiling
