Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling
Pratyay Banerjee, Kuntal Kumar Pal, Fish Wang, Chitta Baral

TL;DR
This paper introduces VarBERT, a neural network model that leverages masked language modeling and a novel finetuning technique to accurately recover variable names from decompiled binary code, significantly outperforming previous methods.
Contribution
The paper presents a new approach using Constrained Masked Language Modeling and neural architectures like BERT to improve variable name recovery in decompiled code, with a novel post-processing algorithm for token count prediction.
Findings
Achieves up to 84.15% accuracy in variable name prediction
Outperforms existing state-of-the-art methods
Uses a large-scale dataset of 164,632 binaries
Abstract
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. While modern decompilers can reconstruct and recover much information that is discarded during compilation, inferring variable names is still extremely difficult. Inspired by recent advances in natural language processing, we propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as Transformers and BERT. Our solution takes \textit{raw} decompiler output, the less semantically meaningful code, as input, and enriches it using our proposed \textit{finetuning} technique, Constrained Masked Language Modeling. Using Constrained Masked Language Modeling introduces the challenge of predicting the number of masked tokens for the original variable name.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Topic Modeling
MethodsLinear Layer · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · WordPiece · Adam · Dense Connections · Softmax · Layer Normalization · Dropout
