Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis
Michael J. Bommarito II

TL;DR
This paper introduces the Binary BPE tokenizer family, a set of cross-platform byte pair encoding tokenizers trained on diverse binaries, significantly improving context efficiency for transformer models in binary analysis tasks.
Contribution
The paper presents a novel family of cross-platform BPE tokenizers trained on large, diverse binary datasets, enabling more efficient binary analysis and deployment across various platforms.
Findings
Binary BPE tokenizers discover interpretable binary patterns.
They enable 2-3x more binary content per transformer context window.
Tokenizers are released as open-source on HuggingFace.
Abstract
Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on arbitrary 0x00--0xFF sequences. To address this issue, we introduce the Binary BPE tokenizer family, a set of cross-platform Byte Pair Encoding (BPE) tokenizers for executables trained on a large corpus of binaries spanning multiple platforms, architectures, and operating systems, including Linux, Windows, macOS, Android, and malware sources. We release trained tokenizers with vocabularies of 4K, 8K, 16K, 32K, and 64K tokens, enabling both systematic scaling studies and practical deployment from resource-constrained edge devices to high-throughput datacenters. These tokenizers discover interpretable patterns (ELF/PE headers, instruction sequences,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Security and Verification in Computing · Digital and Cyber Forensics
