Neural Reverse Engineering of Stripped Binaries using Augmented Control   Flow Graphs

Yaniv David; Uri Alon; Eran Yahav

arXiv:1902.09122·cs.LG·December 1, 2020

Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

Yaniv David, Uri Alon, Eran Yahav

PDF

1 Repo

TL;DR

This paper introduces a neural approach combining static analysis and control-flow graph representations to improve procedure name prediction in stripped binaries, significantly outperforming previous models.

Contribution

The novel integration of static analysis with neural models and CFG encoding advances reverse engineering of stripped executables.

Findings

01

Models outperform existing methods by 28%

02

Achieve 100% improvement over neural textual models without static analysis

03

Predictions are more accurate and time-consuming for humans to replicate

Abstract

We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while…

Figures19

Click any figure to enlarge with its caption.

Tables6

Table 1. Table 1. An example for slice information sets created by three x64 instructions: V r e a d | w r i t e subscript 𝑉 conditional 𝑟 𝑒 𝑎 𝑑 𝑤 𝑟 𝑖 𝑡 𝑒 V_{read|write} sets show values read and written into and P r e a d | w r i t e subscript 𝑃 conditional 𝑟 𝑒 𝑎 𝑑 𝑤 𝑟 𝑖 𝑡 𝑒 P_{read|write} show pointer dereferences for reading from and writing to memory.

	Instruction	$V_{r e a d}$	$V_{w r i t e}$	$P_{r e a d}$	$P_{w r i t e}$
$i n s t_{1}$	mov rax,5	$5$	rax	$\emptyset$	$\emptyset$
$i n s t_{2}$	mov rax,[rbx+5]	rbx	rax	rbx+5	$\emptyset$
$i n s t_{3}$	call rcx	rcx	rax	rcx	$\emptyset$

Table 2. Table 2. Our models outperform previous work, DIRE and Debin , by a relative improvement of 28 % percent 28 28\% and 35 % percent 35 35\% resp.; learning from the flat assembly code (LSTM-text, Transformer-text) yields much lower results. Obfuscating API calls hurts all models, but thanks to the use of abstract and concrete values , our model still performs better than the baselines.

	Stripped			Stripped & Obfuscated API calls
Model	Precision	Recall	F1	Precision	Recall	F1
LSTM-text	22.32	21.16	21.72	15.46	14.00	14.70
Transformer-text	25.45	15.97	19.64	18.41	12.24	14.70
Debin (He et al., 2018)	34.86	32.54	33.66	32.10	28.76	30.09
DIRE (Lacomis et al., 2019)	38.02	33.33	35.52	23.14	25.88	24.43
Nero-LSTM	39.94	38.89	39.40	39.12	31.40	34.83
Nero-Transformer	41.54	38.64	40.04	36.50	32.25	34.24
Nero-GNN	48.61	42.82	45.53	40.53	37.26	38.83

Table 3. Table 3 . Examples from our test set and predictions made by the different models. Even when a prediction is not an “exact match” to the ground truth, it usually captures more subtokens of the ground truth than the baselines. More examples can be found in Appendix A .

Model	Prediction
Ground truth	locate unset	free words	get user groups	install signal handlers
Debin	var is unset	search	display	signal setup
DIRE	env concat	restore	prcess file	overflow
LSTM-text	url get arg	func free	¡unk¿	¡unk¿
Transformer-text	¡unk¿	¡unk¿	close stdin	¡unk¿
Nero-LSTM	var is unset	quotearg free	get user groups	enable mouse
Nero-Transformer	var is unset	quotearg free	open op	¡empty¿
Nero-GNN	var is unset	free table	get user groups	signal enter handlers

Table 4. Table 4 . Variations on our models that ablate different components.

Model	Prec	Rec	F1
BiLSTM calls	23.45	24.56	24.04
BiLSTM call sites	36.05	31.77	33.77
Nero-LSTM no-values	27.22	23.91	25.46
Nero-Transformer no-values	29.84	24.08	26.65
Nero-GNN no-values	45.20	32.65	37.91
Nero-LSTM no-library-debug	39.51	40.33	39.92
Nero-Transformer no-library-debug	43.60	37.65	40.44
Nero-GNN no-library-debug	47.73	42.82	45.14
Nero Transformer $\to$ LSTM	39.05	36.47	37.72
Nero-LSTM	39.94	38.89	39.40
Nero-Transformer	41.54	38.64	40.04
Nero-GNN	48.61	42.82	45.53

Table 5. Table 5. Examination of common interesting model mistakes.

Error Type	Package	Ground Truth	Predicted Name
Programmers VS English Language	wget	i18n_initialize	i18n_init
	direvent	split_cfg_path	split_config_path
	gzip	add_env_opt	add_option
Data Structure Name Missing	gtypist	get_best_speed	get_list_item
	wget	ftp_parse_winnt_ls	parse_tree
	direvent	filename_pattern_free	free_buffer
	gzip	abort_gzip_signal	fatal_signal_handler
Verb Replacement	findutils	share_file_fopen	add_file
	units	read_units	parse
	wget	retrieve_from_file	get_from_file
	mcsim	display_help	show_help

Table 6. Table 6 . Examples from our test set and predictions made by the different models.

Ground Truth	He et al. (2018)	LSTM-text	Transformer-text	BiLSTM call-sites	Nero-LSTM
mktime from utc	nettle pss …	get boundary	¡unk¿	str file	mktime
read buffer	concat	fopen safer	mh print fmtspec	net read	filter read
get widech	get byte	user	mh decode rcpt flag	¡unk¿	do tolower
ftp parse winnt ls	uuconf iv …	mktime	print status	send to file	parse tree
write init pos	allocate pic buf	open int	¡unk¿	print type	cfg init
wait for proc	wait subprocess	start open	mh print fmtspec	¡unk¿	strip
read string	cmp	error	check command	process	io read
find env	find env pos	proper name utf	close stream	read token	find env
write calc jacob	usage msg	update pattern	print one paragraph	¡unk¿	write
write calc outputs	fsquery show	debug section	cwd advance fd	¡unk¿	write
get script line	get line	make dir hier	¡unk¿	read ps line	jconfig get
getuser readline	stdin read readline	rushdb print	mh decode rcpt flag	write line	readline read
set max db age	do link	set owner	make dir hier	sparse copy	set
write calc deriv	orthodox hdy	ds symbol	close stream	fprint entry	write type
read file	bt open	¡unk¿	… disable coredump	¡unk¿	vfs read file
parse options	parse options	finish	mh print fmtspec	get options	parse args
url free	hash rehash	hostname destroy	setupvariables	hol free	free dfa content
check new watcher	read index	check opt	¡unk¿	open source	check file
open input file	get options	query in	ck rename	set	delete input
write calc jacob	put in fp table	save game var	hostname destroy	¡unk¿	write
filename pattern free	add char segment	free dfa content	hostname destroy	glob cleanup	free exclude segment
read line	tartime	init all	close stdout	parse args	read
ftp parse unix ls	serv select fn	canonicalize	¡unk¿	¡unk¿	parse syntax option
free netrc	gea compile	hostname destroy	hostname destroy	free ent	hol free
string to bool	string to bool	setnonblock	mh decode rcpt flag	string to bool	string to bool

Equations35

p (y_{1}, ..., y_{m} ∣ x_{1}, ..., x_{n}) = t = 1 \prod m p (y_{t} ∣ y_{< t}, z_{1}, ..., z_{n})

p (y_{1}, ..., y_{m} ∣ x_{1}, ..., x_{n}) = t = 1 \prod m p (y_{t} ∣ y_{< t}, z_{1}, ..., z_{n})

z_{1}, ..., z_{n}

z_{1}, ..., z_{n}

h_{1}^{d ec}, ..., h_{m}^{d ec}

h_{1}^{d ec}, ..., h_{m}^{d ec}

p (y_{t} ∣ y_{< t}, z_{1}, ..., z_{n}) = softmax (E^{o u t} \cdot h_{t}^{d ec})

p (y_{t} ∣ y_{< t}, z_{1}, ..., z_{n}) = softmax (E^{o u t} \cdot h_{t}^{d ec})

α_{t} = softmax (z \cdot W_{a} \cdot h_{t}^{d ec})

α_{t} = softmax (z \cdot W_{a} \cdot h_{t}^{d ec})

c_{t} = i \sum n α_{t}_{i} \cdot z_{i}

c_{t} = i \sum n α_{t}_{i} \cdot z_{i}

h_{t}^{d ec} =

h_{t}^{d ec} =

p (y_{t} ∣ y_{< t}, z_{1}, ..., z_{n}) =

Q = W_{q} \cdot x

Q = W_{q} \cdot x

α (Q, K) = softmax (\frac{Q K ^{⊤}}{d _{k}})

α (Q, K) = softmax (\frac{Q K ^{⊤}}{d _{k}})

Attention (Q, K, V) = α (Q, K) \cdot V

Attention (Q, K, V) = α (Q, K) \cdot V

h_{v}^{(k)} = f_{k} (h_{v}^{(k - 1)}, {h_{u}^{(k - 1)} ∣ u \in N_{v}}; θ_{k})

h_{v}^{(k)} = f_{k} (h_{v}^{(k - 1)}, {h_{u}^{(k - 1)} ∣ u \in N_{v}}; θ_{k})

h_{v}^{(k)} = σ (u \in N_{v} \sum \frac{1}{c _{u, v}} W_{n e i g hb or}^{(k)} h_{u}^{(k - 1)} + W_{se l f}^{(k)} h_{v}^{(k - 1)})

h_{v}^{(k)} = σ (u \in N_{v} \sum \frac{1}{c _{u, v}} W_{n e i g hb or}^{(k)} h_{u}^{(k - 1)} + W_{se l f}^{(k)} h_{v}^{(k - 1)})

[[P]] = {in s t r u c t i o n s (p) ∣ p \in s im pl e P a t h s (E n t r y, S ink)} .

[[P]] = {in s t r u c t i o n s (p) ∣ p \in s im pl e P a t h s (E n t r y, S ink)} .

e n co d e_c a l l s i t e (w_{1} ... w_{k_{s}}, v a l u e_{1}, ..., v a l u e_{k_{a r g s}}) = [(i \sum k_{s} E_{w_{i}}^{nam es}); E_{v a l u e_{1}}^{v a l u es}; ...; E_{v a l u e_{k_{a r g s}}}^{v a l u es}]

e n co d e_c a l l s i t e (w_{1} ... w_{k_{s}}, v a l u e_{1}, ..., v a l u e_{k_{a r g s}}) = [(i \sum k_{s} E_{w_{i}}^{nam es}); E_{v a l u e_{1}}^{v a l u es}; ...; E_{v a l u e_{k_{a r g s}}}^{v a l u es}]

h_{1}, ..., h_{l} = L S T M_{e n co d er} (c a l l s i t e_{1}, ..., c a l l s i t e_{l})

h_{1}, ..., h_{l} = L S T M_{e n co d er} (c a l l s i t e_{1}, ..., c a l l s i t e_{l})

z = [h_{l}^{\to}; h_{l}^{\leftarrow}]

z = [h_{l}^{\to}; h_{l}^{\leftarrow}]

z = Transformer_{e n co d er} (c a l l s i t e_{1}, ..., c a l l s i t e_{l})

z = Transformer_{e n co d er} (c a l l s i t e_{1}, ..., c a l l s i t e_{l})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tech-srl/Nero
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

ABI application binary interface API application program interface AST abstract syntax tree AUC area under (the) curve BB basic block CROC concentrated ROC CFG control-flow graph CU compilation unit CVE common vulnerabilities (and) exposures CPU central processing unit DAG directed acyclic graph DNN deep nural network ELF executable and linkable format GOT global offset table GNN graph neural network GCN graph convolutional network FP false positive FN false negative FOL first order logic HTTP the hypertext transfer protocol IL intermediate language IOT internet of things IP intellectual property IR intermediate representation ISA instruction set architecture IVL intermediate verification language LCS longest common subsequence LSTM long short-term memory network ML machine language NLP natural language processing NMT neural machine translation OS operating system OOV out-of-vocabulary PDG program dependence graph PIC position independent code TP true positive TN true negative RE reverse engineering ROC receiver operating characteristic RNN recurrent neural network SSL secure sockets layer SSA single static assignment seq2seq sequence-to-sequence SSA single static assignment

Neural Reverse Engineering of Stripped Binaries

using Augmented Control Flow Graphs

Yaniv David

TechnionIsrael

[email protected]

,

Uri Alon

TechnionIsrael

[email protected]

and

Eran Yahav

TechnionIsrael

[email protected]

(2020)

Abstract.

We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations.

We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures.

Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by $28\%$ and by $100\%$ over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero.

††copyright: none††journal: PACMPL††journalyear: 2020††journalvolume: 4††journalnumber: OOPSLA††publicationmonth: 11††doi: 10.1145/3428230

1. Introduction

Reverse engineering (RE) of executables has a variety of applications such as improving and debugging legacy programs. Furthermore, it is crucial to analyzing malware. Unfortunately, it is a hard skill to learn, and it takes years to master. Even experienced professionals often have to invest long hours to obtain meaningful results. The main challenge is to understand how the different “working parts” inside the executable are meant to interact to carry out the objective of the executable. A human reverse-engineer has to guess, based on experience, the more interesting procedures to begin with, follow the flow in these procedures, use inter-procedural patterns and finally, piece all these together to develop a global understanding of the purpose and usage of the inspected executable.

Despite great progress on disassemblers [IDAPRO; RADAR], static analysis frameworks (Katz et al., 2018; Lee et al., 2011) and similarity detectors (David et al., 2017; Pewny et al., 2015), for the most part, the reverse engineering process remains manual.

Reviewing source code containing meaningful names for procedures can reduce human effort dramatically, since it saves the time and effort of looking at some procedure bodies (Alon et al., 2019c; Høst and Østvold, 2009; Fowler and Beck, 1999; Jacobson et al., 2011). Binary executables are usually stripped, i.e., the debug information containing procedure names is removed entirely.

As a result of executable stripping, a major part of a reverse engineer’s work is to manually label procedures after studying them. Votipka et al. (2020) detail this process in a user study of reverse engineers and depict their reliance on internal and external procedure names throughout their study.

In recent years, great strides have been made in the analysis of source code using learned models from automatic inference of variables and types (Raychev et al., 2015; Bielik et al., 2016; Alon et al., 2018; Bavishi et al., 2018; Allamanis et al., 2018) to bug detection (Pradel and Sen, 2018; Rice et al., 2017), code summarization (Allamanis et al., 2016; Alon et al., 2019c, a), code retrieval (Sachdev et al., 2018; Allamanis et al., 2015b) and even code generation (Murali et al., 2017; Brockschmidt et al., 2019; Lu et al., 2017; Alon et al., 2019b). However, all of these address high-level and syntactically-rich programming languages such as Java, C# and Python. None of them address the unique challenges present in executables.

Problem definition Given a nameless assembly procedure $\mathcal{X}$ residing in a stripped (containing no debug information) executable, our goal is to predict a likely and descriptive name $\mathcal{Y}=y_{1}...,y_{m}$ , where $y_{1}...,y_{m}$ are the subtokens composing $\mathcal{Y}$ . Thus, our goal is to model $P\left(\mathcal{Y}\mid\mathcal{X}\right)$ . For example, for the name $\mathcal{Y}=$ create_server_socket, the subtokens $y_{1}...,y_{m}$ that we aim to predict are create, server and socket, respectively.

The problem of predicting a meaningful name for a given procedure can be viewed as a translation task – translating from assembly code to natural language. While this high-level view of the problem is shared with previous work (e.g., (Allamanis et al., 2016; Alon et al., 2019c, a)), the technical challenges are vastly different due to the different characteristic of binaries.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Alammar ([n. d.]) Jay Alammar. [n. d.]. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/ .
3Allamanis (2018) Miltiadis Allamanis. 2018. The Adverse Effects of Code Duplication in Machine Learning Models of Code. ar Xiv preprint ar Xiv:1812.06469 (2018).
4Allamanis et al . (2015 a) Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015 a. Suggesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015) . ACM, New York, NY, USA, 38–49. https://doi.org/10.1145/2786805.2786849 · doi ↗
5Allamanis et al . (2018) Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In ICLR .
6Allamanis et al . (2016) Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 . 2091–2100. http://jmlr.org/proceedings/papers/v 48/allamanis 16.html
7Allamanis et al . (2015 b) Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015 b. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15) . JMLR.org, 2123–2132. http://dl.acm.org/citation.cfm?id=3045118.3045344
8Alon et al . (2019 a) Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019 a. code 2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations . https://openreview.net/forum?id=H 1g K Yo 09t X