How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder

TL;DR
This paper systematically evaluates how different tokenization algorithms affect the performance of transformer-based models in binary code analysis, highlighting the importance of tokenizer choice for downstream tasks.
Contribution
It provides a comprehensive analysis of tokenization methods tailored for assembly code, exploring their intrinsic properties and impact on model effectiveness in binary analysis tasks.
Findings
Tokenizer choice significantly impacts downstream performance.
Intrinsic metrics only partially predict extrinsic evaluation outcomes.
Trade-offs exist between tokenizer efficiency and semantic fidelity.
Abstract
Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction -- a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Advanced Malware Detection Techniques
