A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer
Hanxiao Lu, Hongyu Cai, Yiming Liang, Antonio Bianchi, and Z. Berkay, Celik

TL;DR
This paper introduces ProTST, a progressive transformer framework that unifies binary code embedding and knowledge transfer, improving performance across multiple binary analysis tasks without complex feature engineering.
Contribution
ProTST employs a hierarchical, progressive training paradigm that enhances binary code embeddings by building knowledge from fundamental to specialized tasks, reducing reliance on complex features.
Findings
ProTST outperforms traditional two-stage training by 14.8% in validation scores.
ProTST surpasses multimodal frameworks by 10.7% on average.
ProTST effectively improves binary analysis tasks without complex feature engineering.
Abstract
Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Speech Recognition and Synthesis · Neural Networks and Applications
