Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
Meifang Chen, Zhe Yang, Huang Nianchen, Yizhan Huang, Yichen Li, Zihan Li, and Michael R. Lyu

TL;DR
This paper investigates how Byte-Pair Encoding tokenization causes secret leakage risks in Code Large Language Models by creating a gibberish bias that makes certain secrets easier to memorize.
Contribution
It uncovers the impact of BPE tokenization on secret memorization in CLLMs and analyzes the root causes of this bias with implications for tokenizer design.
Findings
Secrets with high character entropy but low token entropy are more easily memorized.
Token distribution shift between training data and secrets causes gibberish bias.
Larger vocabularies in tokenizers exacerbate secret leakage risks.
Abstract
Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
