TL;DR
This paper introduces a neural approach combining static analysis and control-flow graph representations to improve procedure name prediction in stripped binaries, significantly outperforming previous models.
Contribution
The novel integration of static analysis with neural models and CFG encoding advances reverse engineering of stripped executables.
Findings
Models outperform existing methods by 28%
Achieve 100% improvement over neural textual models without static analysis
Predictions are more accurate and time-consuming for humans to replicate
Abstract
We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19| Instruction | |||||
|---|---|---|---|---|---|
| mov rax,5 | rax | ||||
| mov rax,[rbx+5] | rbx | rax | rbx+5 | ||
| call rcx | rcx | rax | rcx |
| Stripped | Stripped & Obfuscated API calls | ||||||
| Model | Precision | Recall | F1 | Precision | Recall | F1 | |
| LSTM-text | 22.32 | 21.16 | 21.72 | 15.46 | 14.00 | 14.70 | |
| Transformer-text | 25.45 | 15.97 | 19.64 | 18.41 | 12.24 | 14.70 | |
| Debin (He et al., 2018) | 34.86 | 32.54 | 33.66 | 32.10 | 28.76 | 30.09 | |
| DIRE (Lacomis et al., 2019) | 38.02 | 33.33 | 35.52 | 23.14 | 25.88 | 24.43 | |
| Nero-LSTM | 39.94 | 38.89 | 39.40 | 39.12 | 31.40 | 34.83 | |
| Nero-Transformer | 41.54 | 38.64 | 40.04 | 36.50 | 32.25 | 34.24 | |
| Nero-GNN | 48.61 | 42.82 | 45.53 | 40.53 | 37.26 | 38.83 | |
| Model | Prediction | |||
| Ground truth | locate unset | free words | get user groups | install signal handlers |
| Debin | var is unset | search | display | signal setup |
| DIRE | env concat | restore | prcess file | overflow |
| LSTM-text | url get arg | func free | ¡unk¿ | ¡unk¿ |
| Transformer-text | ¡unk¿ | ¡unk¿ | close stdin | ¡unk¿ |
| Nero-LSTM | var is unset | quotearg free | get user groups | enable mouse |
| Nero-Transformer | var is unset | quotearg free | open op | ¡empty¿ |
| Nero-GNN | var is unset | free table | get user groups | signal enter handlers |
| Model | Prec | Rec | F1 |
|---|---|---|---|
| BiLSTM calls | 23.45 | 24.56 | 24.04 |
| BiLSTM call sites | 36.05 | 31.77 | 33.77 |
| Nero-LSTM no-values | 27.22 | 23.91 | 25.46 |
| Nero-Transformer no-values | 29.84 | 24.08 | 26.65 |
| Nero-GNN no-values | 45.20 | 32.65 | 37.91 |
| Nero-LSTM no-library-debug | 39.51 | 40.33 | 39.92 |
| Nero-Transformer no-library-debug | 43.60 | 37.65 | 40.44 |
| Nero-GNN no-library-debug | 47.73 | 42.82 | 45.14 |
| Nero TransformerLSTM | 39.05 | 36.47 | 37.72 |
| Nero-LSTM | 39.94 | 38.89 | 39.40 |
| Nero-Transformer | 41.54 | 38.64 | 40.04 |
| Nero-GNN | 48.61 | 42.82 | 45.53 |
| Error Type | Package | Ground Truth | Predicted Name |
| Programmers VS English Language | wget | i18n_initialize | i18n_init |
| direvent | split_cfg_path | split_config_path | |
| gzip | add_env_opt | add_option | |
| Data Structure Name Missing | gtypist | get_best_speed | get_list_item |
| wget | ftp_parse_winnt_ls | parse_tree | |
| direvent | filename_pattern_free | free_buffer | |
| gzip | abort_gzip_signal | fatal_signal_handler | |
| Verb Replacement | findutils | share_file_fopen | add_file |
| units | read_units | parse | |
| wget | retrieve_from_file | get_from_file | |
| mcsim | display_help | show_help |
| Ground Truth | He et al. (2018) | LSTM-text | Transformer-text | BiLSTM call-sites | Nero-LSTM |
|---|---|---|---|---|---|
| mktime from utc | nettle pss … | get boundary | ¡unk¿ | str file | mktime |
| read buffer | concat | fopen safer | mh print fmtspec | net read | filter read |
| get widech | get byte | user | mh decode rcpt flag | ¡unk¿ | do tolower |
| ftp parse winnt ls | uuconf iv … | mktime | print status | send to file | parse tree |
| write init pos | allocate pic buf | open int | ¡unk¿ | print type | cfg init |
| wait for proc | wait subprocess | start open | mh print fmtspec | ¡unk¿ | strip |
| read string | cmp | error | check command | process | io read |
| find env | find env pos | proper name utf | close stream | read token | find env |
| write calc jacob | usage msg | update pattern | print one paragraph | ¡unk¿ | write |
| write calc outputs | fsquery show | debug section | cwd advance fd | ¡unk¿ | write |
| get script line | get line | make dir hier | ¡unk¿ | read ps line | jconfig get |
| getuser readline | stdin read readline | rushdb print | mh decode rcpt flag | write line | readline read |
| set max db age | do link | set owner | make dir hier | sparse copy | set |
| write calc deriv | orthodox hdy | ds symbol | close stream | fprint entry | write type |
| read file | bt open | ¡unk¿ | … disable coredump | ¡unk¿ | vfs read file |
| parse options | parse options | finish | mh print fmtspec | get options | parse args |
| url free | hash rehash | hostname destroy | setupvariables | hol free | free dfa content |
| check new watcher | read index | check opt | ¡unk¿ | open source | check file |
| open input file | get options | query in | ck rename | set | delete input |
| write calc jacob | put in fp table | save game var | hostname destroy | ¡unk¿ | write |
| filename pattern free | add char segment | free dfa content | hostname destroy | glob cleanup | free exclude segment |
| read line | tartime | init all | close stdout | parse args | read |
| ftp parse unix ls | serv select fn | canonicalize | ¡unk¿ | ¡unk¿ | parse syntax option |
| free netrc | gea compile | hostname destroy | hostname destroy | free ent | hol free |
| string to bool | string to bool | setnonblock | mh decode rcpt flag | string to bool | string to bool |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
ABI application binary interface API application program interface AST abstract syntax tree AUC area under (the) curve BB basic block CROC concentrated ROC CFG control-flow graph CU compilation unit CVE common vulnerabilities (and) exposures CPU central processing unit DAG directed acyclic graph DNN deep nural network ELF executable and linkable format GOT global offset table GNN graph neural network GCN graph convolutional network FP false positive FN false negative FOL first order logic HTTP the hypertext transfer protocol IL intermediate language IOT internet of things IP intellectual property IR intermediate representation ISA instruction set architecture IVL intermediate verification language LCS longest common subsequence LSTM long short-term memory network ML machine language NLP natural language processing NMT neural machine translation OS operating system OOV out-of-vocabulary PDG program dependence graph PIC position independent code TP true positive TN true negative RE reverse engineering ROC receiver operating characteristic RNN recurrent neural network SSL secure sockets layer SSA single static assignment seq2seq sequence-to-sequence SSA single static assignment
Neural Reverse Engineering of Stripped Binaries
using Augmented Control Flow Graphs
Yaniv David
TechnionIsrael
,
Uri Alon
TechnionIsrael
and
Eran Yahav
TechnionIsrael
(2020)
Abstract.
We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations.
We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures.
Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by and by over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero.
††copyright: none††journal: PACMPL††journalyear: 2020††journalvolume: 4††journalnumber: OOPSLA††publicationmonth: 11††doi: 10.1145/3428230
1. Introduction
Reverse engineering (RE) of executables has a variety of applications such as improving and debugging legacy programs. Furthermore, it is crucial to analyzing malware. Unfortunately, it is a hard skill to learn, and it takes years to master. Even experienced professionals often have to invest long hours to obtain meaningful results. The main challenge is to understand how the different “working parts” inside the executable are meant to interact to carry out the objective of the executable. A human reverse-engineer has to guess, based on experience, the more interesting procedures to begin with, follow the flow in these procedures, use inter-procedural patterns and finally, piece all these together to develop a global understanding of the purpose and usage of the inspected executable.
Despite great progress on disassemblers [IDAPRO; RADAR], static analysis frameworks (Katz et al., 2018; Lee et al., 2011) and similarity detectors (David et al., 2017; Pewny et al., 2015), for the most part, the reverse engineering process remains manual.
Reviewing source code containing meaningful names for procedures can reduce human effort dramatically, since it saves the time and effort of looking at some procedure bodies (Alon et al., 2019c; Høst and Østvold, 2009; Fowler and Beck, 1999; Jacobson et al., 2011). Binary executables are usually stripped, i.e., the debug information containing procedure names is removed entirely.
As a result of executable stripping, a major part of a reverse engineer’s work is to manually label procedures after studying them. Votipka et al. (2020) detail this process in a user study of reverse engineers and depict their reliance on internal and external procedure names throughout their study.
In recent years, great strides have been made in the analysis of source code using learned models from automatic inference of variables and types (Raychev et al., 2015; Bielik et al., 2016; Alon et al., 2018; Bavishi et al., 2018; Allamanis et al., 2018) to bug detection (Pradel and Sen, 2018; Rice et al., 2017), code summarization (Allamanis et al., 2016; Alon et al., 2019c, a), code retrieval (Sachdev et al., 2018; Allamanis et al., 2015b) and even code generation (Murali et al., 2017; Brockschmidt et al., 2019; Lu et al., 2017; Alon et al., 2019b). However, all of these address high-level and syntactically-rich programming languages such as Java, C# and Python. None of them address the unique challenges present in executables.
Problem definition Given a nameless assembly procedure residing in a stripped (containing no debug information) executable, our goal is to predict a likely and descriptive name , where are the subtokens composing . Thus, our goal is to model . For example, for the name create_server_socket, the subtokens that we aim to predict are create, server and socket, respectively.
The problem of predicting a meaningful name for a given procedure can be viewed as a translation task – translating from assembly code to natural language. While this high-level view of the problem is shared with previous work (e.g., (Allamanis et al., 2016; Alon et al., 2019c, a)), the technical challenges are vastly different due to the different characteristic of binaries.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Alammar ([n. d.]) Jay Alammar. [n. d.]. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/ .
- 3Allamanis (2018) Miltiadis Allamanis. 2018. The Adverse Effects of Code Duplication in Machine Learning Models of Code. ar Xiv preprint ar Xiv:1812.06469 (2018).
- 4Allamanis et al . (2015 a) Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015 a. Suggesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015) . ACM, New York, NY, USA, 38–49. https://doi.org/10.1145/2786805.2786849 · doi ↗
- 5Allamanis et al . (2018) Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In ICLR .
- 6Allamanis et al . (2016) Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 . 2091–2100. http://jmlr.org/proceedings/papers/v 48/allamanis 16.html
- 7Allamanis et al . (2015 b) Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015 b. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15) . JMLR.org, 2123–2132. http://dl.acm.org/citation.cfm?id=3045118.3045344
- 8Alon et al . (2019 a) Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019 a. code 2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations . https://openreview.net/forum?id=H 1g K Yo 09t X
