Precise Null Pointer Analysis Through Global Value Numbering

Ankush Das; Akash Lal

arXiv:1702.05807·cs.PL·February 21, 2017

Precise Null Pointer Analysis Through Global Value Numbering

Ankush Das, Akash Lal

PDF

Open Access 1 Repo

TL;DR

This paper introduces a semantics-preserving program transformation based on Global Value Numbering that significantly enhances the precision of null pointer alias analysis with minimal overhead.

Contribution

It presents a novel program transformation technique that improves null pointer alias analysis precision using Global Value Numbering, without sacrificing scalability.

Findings

01

Precision of null pointer analysis improved from 86.56% to 98.05%.

02

Transformation incurs insignificant overhead.

03

Applicable to real-world code for better static analysis accuracy.

Abstract

Precise analysis of pointer information plays an important role in many static analysis techniques and tools today. The precision, however, must be balanced against the scalability of the analysis. This paper focusses on improving the precision of standard context and flow insensitive alias analysis algorithms at a low scalability cost. In particular, we present a semantics-preserving program transformation that drastically improves the precision of existing analyses when deciding if a pointer can alias NULL. Our program transformation is based on Global Value Numbering, a scheme inspired from compiler optimizations literature. It allows even a flow-insensitive analysis to make use of branch conditions such as checking if a pointer is NULL and gain precision. We perform experiments on real-world code to measure the overhead in performing the transformation and the improvement in the…

Tables1

Table 1. Table 1: Results showing the effect of SSA and GVN program transformations on the ability of alias analysis to prove safety of non-null assertions.

	Stats			SSA only		SSA with GVN
Bench	Procs	KLOC	Asserts	Time(s)	Asserts	Time(s)	GVN(s)	Asserts
Mod 1	40	3.2	1741	9.08	61	11.37	0.88	17
Mod 2	37	8.4	4035	11.34	233	17.62	1.13	45
Mod 3	64	6.5	4375	10.26	617	19.43	2.15	52
Mod 4	184	20.9	7523	24.04	543	33.99	2.43	123
Mod 5	284	30.9	11184	35.02	1881	59.84	7.11	232
Mod 6	382	37.8	12128	35.94	2675	70.71	11.13	452
Mod 7	453	37.2	6840	36.88	1396	53.24	3.44	127
Mod 8	400	43.8	12209	28.91	2854	62.27	5.38	475
Mod 9	479	56.6	19030	60.05	5444	106.61	12.40	508
Mod 10	998	76.5	39955	171.43	2887	839.58	475.08	372
Mod 11	867	23.5	6966	49.17	875	69.10	10.14	103
Mod 12	303	14.9	8359	24.57	820	59.13	13.41	210
Mod 13	419	22.1	11471	38.27	869	87.07	24.03	248
Mod 14	493	36.2	18026	48.56	2501	149.60	41.93	478
Mod 15	317	19.4	20555	55.07	586	269.35	134.06	131
Mod 16	809	54.0	16957	62.86	2821	127.67	30.46	342
Total	6529	491.9	201354	701.45	27063	2036.58	775.16	3915

Equations2

\begin{array}[]{rcl}ComputeHash(\mathsf{v})\text{ at }S&=&ComputeHash(\mathsf{expr})\text{ at }L\\ ComputeHash(\mathsf{v})\text{ at }L&=&ComputeHash(\mathsf{expr})\text{ at }L\\ \end{array}

\begin{array}[]{rcl}ComputeHash(\mathsf{v})\text{ at }S&=&ComputeHash(\mathsf{expr})\text{ at }L\\ ComputeHash(\mathsf{v})\text{ at }L&=&ComputeHash(\mathsf{expr})\text{ at }L\\ \end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

boogie-org/corral/tree/master/source/AliasAnalysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Security and Verification in Computing · Logic, programming, and type systems

Full text

Precise Null Pointer Analysis Through

Global Value Numbering

Ankush Das Carnegie Mellon University

Pittsburgh, PA, USA [email protected] Microsoft Research

Bangalore, India

Akash Lal Microsoft Research

Bangalore, India [email protected]

Abstract

Precise analysis of pointer information plays an important role in many static analysis techniques and tools today. The precision, however, must be balanced against the scalability of the analysis. This paper focusses on improving the precision of standard context and flow insensitive alias analysis algorithms at a low scalability cost. In particular, we present a semantics-preserving program transformation that drastically improves the precision of existing analyses when deciding if a pointer can alias Null. Our program transformation is based on Global Value Numbering, a scheme inspired from compiler optimizations literature. It allows even a flow-insensitive analysis to make use of branch conditions such as checking if a pointer is Null and gain precision. We perform experiments on real-world code to measure the overhead in performing the transformation and the improvement in the precision of the analysis. We show that the precision improves from 86.56% to 98.05%, while the overhead is insignificant.

Keywords: Alias Analysis, Global Value Numbering, Static Single Assignment

1 Introduction

Null-pointer exceptions directly affect software reliability because such exceptions can bring down the application. Detecting and eliminating these bugs is an important step towards developing reliable systems. Static analysis tools that look for null-pointer exceptions typically employ techniques based on alias analysis to detect possible aliasing between pointers. Two pointer-valued variables are said to alias if they hold the same memory location during runtime. Aliasing can be decided in two ways: (a) may-alias [2], where two pointers are said to may-alias if they can point to the same memory location under some possible execution, and (b) must-alias [29], where two pointers must-alias if they always point to the same memory location under all possible executions. Because a precise alias analysis is undecidable [25] and even a flow-insensitive pointer analysis is NP-hard [15], much of the research in the area plays on the precision-efficiency trade-off of alias analysis. For example, practical algorithms for may-alias analysis lose precision (but retain soundness) by over-approximating: a verdict that two pointer may-alias does not imply that there is some execution in which they actually hold the same value. Whereas, a verdict that two pointers cannot may-alias must imply that there is no execution in which they hold the same value.

We use a sound may-alias analysis in an attempt to prove the safety of a program with respect to null-pointer exceptions. For each pointer dereference, we ask the analysis if the pointer can may-alias Null just before the dereference. If the answer is that it cannot, then the pointer cannot hold a Null value under all possible executions, hence the dereference is safe. The more precise the analysis, the more dereferences it can prove safe. This paper demonstrates a technique that improves the precision of may-alias analysis at little cost when answering aliasing queries of pointers with the Null value.

The Null value is special because programmers tend to be defensive against null-pointer exceptions. If there is doubt that a pointer, say $\mathsf{x}$ , can be Null or not, the programmer would use a check “ $\mathsf{if\;(x\neq\textsc{Null})}$ ” before dereferencing $\mathsf{x}$ . Existing alias analysis techniques, especially flow insensitive techniques for may-alias analysis, ignore all branch conditions. As we demonstrate in this paper, exploiting these defensive checks can significantly increase the precision of alias analysis. Our technique is based around a semantics-preserving program transformation and requires only a minor change to the alias analysis algorithm itself.

Program transformations have been used previously to improve the precision for alias analysis. For instance, it is common to use a Single Static Assignment (SSA) conversion [7] before running flow-insensitive analyses. The use of SSA automatically adds some level of flow sensitivity to the analysis [13]. SSA, while useful, is still limited in the amount of precision that it adds, and in particular, it does not help with using branch conditions. We present a program transformation based on Global Value Numbering (GVN) [17] that adds significantly more precision on top of SSA by leveraging branch conditions.

The transformation works by first inserting an assignment $\mathsf{v:=e}$ on the then branch of a check $\mathsf{if\;(e\neq\textsc{Null})}$ , where $\mathsf{v}$ is a fresh program variable. This gives us the global invariant that $\mathsf{v}$ can never hold the Null value. However, this invariant will be of no use unless the program uses $\mathsf{v}$ . Our transformation then searches locally, in the same procedure, for program expressions $\mathsf{e^{\prime}}$ that are equivalent to $\mathsf{e}$ , that is, at runtime they both hold the same value. The transformation then replaces the use of $\mathsf{e^{\prime}}$ with $\mathsf{v}$ . The search for equivalent expressions is done by adapting the GVN algorithm (originally designed for compiler optimizations [11]).

Our transformation can be followed with a standard alias analysis to infer the points-to set for each variable, with a slight change that the new variables introduced by our transformation (such as $\mathsf{v}$ above) cannot be Null. This change stops spurious propagation of Null and makes the analysis more precise. We perform extensive experiments on real-world code. The results show that the precision of the alias analysis (measured in terms of the number of pointer dereferences proved safe) goes from 86.56% to 98.05%. This work is used inside Microsoft’s Static Driver Verifier tool [23] for finding null-pointer bugs in device drivers111https://msdn.microsoft.com/en-us/library/windows/hardware/mt779102(v=vs.85).aspx.

The rest of the paper is organized as follows: Section 2 provides background on flow-insensitive alias analysis and how SSA can improve its precision. Section 3 illustrates our program transformation via an example and Section 4 presents it formally. Section 5 presents experimental results, Section 6 describes some of the related work in the area and Section 7 concludes. Finally, Appendix A proves that the transformation preserves program semantics.

2 Background

2.1 Programming Language

We introduce a simplistic language to demonstrate the alias analysis and how program transformations can be used to increase its precision. As is standard, we concern ourselves only with statements that manipulate pointers. All other statements are ignored (i.e., abstracted away) by the alias analysis. Our language has assignments with one of the following forms: pointer assignments $\mathsf{x:=y}$ , dereferencing via field writes $\mathsf{x.f:=y}$ and reads $\mathsf{x:=y.f}$ , creating new memory locations $\mathsf{x:=new()}$ , or assigning Null as $\mathsf{x:=\textsc{Null}}$ . The language also has $\mathsf{assume}$ and $\mathsf{assert}$ statements:

•

$\mathsf{assume\;B}$ checks the Boolean condition $\mathsf{B}$ and continues execution only if the condition evaluates to true. The $\mathsf{assume}$ statement is a convenient way of modeling branching in most existing source languages. For instance, a branch “ $\mathsf{if\;(B)}$ ” can be modeled using two basic blocks, one beginning with $\mathsf{assume\;B}$ and the other with $\mathsf{assume\;\neg B}$ .

•

$\mathsf{assert\;B}$ checks the Boolean condition $\mathsf{B}$ and continues execution if it holds. If $\mathsf{B}$ does not hold, then it raises an assertion failure and stops program execution.

A program in our language begins with global variable declarations followed by one or more procedures. Each procedure starts with declarations of local variables, followed by a sequence of basic blocks. Each basic block starts with a label, followed by a list of statements, and ends with a control transfer, which is either a $\mathsf{goto}$ or a $\mathsf{return}$ . A $\mathsf{goto}$ in our language can take multiple labels. The choice between which label to jump is non-deterministic. Finally, we disallow loops in the control-flow of a procedure; they can instead be encoded using procedures with recursion. This restriction simplifies the presentation of our algorithms. Figure 1 shows an illustrative example in our language.

2.2 Alias Analysis

This section describes Andersen’s may-alias analysis [2]. The analysis is context and flow-insensitive, which means that it completely abstracts away control of the program. But the analysis is field-sensitive, which means that a value can be obtained by reading a field $\mathsf{f}$ only if it was previously written to the same field $\mathsf{f}$ . Field-insensitive analyses, for example, also abstract away the field name.

The analysis outputs an over-approximation of the set of memory locations each pointer can hold under all possible executions. Since a program can potentially execute indefinitely (because of loops or recursion), the number of memory locations allocated by a program can be unbounded. We consider a finite abstraction of memory locations, commonly called the allocation-site abstraction [16]. Each memory location allocated by the same $\mathsf{new}$ statement is represented using the same abstract value. An abstract value is also called an allocation site. We label each $\mathsf{new}$ statement with a unique number $i$ and refer to its corresponding allocation site as $aS_{\_}i$ . We use the special allocation site $aS_{0}$ to denote Null.

We follow a description of Andersen’s analysis in terms of set constraints [27], shown in Figure 2. The analysis outputs a points-to relation $pt$ where $pt(\mathsf{x})$ represents the points-to set of a variable $\mathsf{x}$ , i.e. (an over-approximation of) the set of allocation sites that $\mathsf{x}$ may hold under all possible executions. In addition, it also computes $pt(aS_{\_}i\mathsf{.f})$ , for each allocation site $aS_{\_}i$ and field $\mathsf{f}$ , representing (an over-approximation of) the set of values written to the $\mathsf{f}$ field of an object represented by $aS_{\_}i$ .

The analysis abstracts away program control along with $\mathsf{assert}$ and $\mathsf{assume}$ statements. It essentially considers a program as a bag of pointer-manipulating statements where any statement can be executed any number of times and in any order. For each statement, the analysis follows Figure 2 to generate a set of rules that define constraints on the points-to solution $pt$ . The rules can be read as follows.

•

If a program has an allocation $\mathsf{x:=new()}$ and this statement is labelled with the unique integer $i$ , then the solution must have $aS_{\_}i\in pt(\mathsf{x})$ .

•

If a program has the statement $\mathsf{x:=NULL}$ , then it must be that $aS_{\_}0\in pt(\mathsf{x})$ .

•

If the program has an assignment $\mathsf{x:=y}$ then the solution must have $pt(\mathsf{y})\subseteq pt(\mathsf{x})$ , because $\mathsf{x}$ may hold any value that $\mathsf{y}$ can hold.

•

If the program has a statement $\mathsf{x:=y.f}$ and $aS_{\_}i\in pt(\mathsf{y})$ , then it must be that $pt(aS_{\_}i\mathsf{.f})\subseteq pt(\mathsf{x})$ because $\mathsf{x}$ may hold any value written to the $\mathsf{f}$ field of $aS_{\_}i$ .

•

If the program has a statement $\mathsf{x.f:=y}$ and $aS_{\_}i\in pt(\mathsf{x})$ then it must be that $pt(\mathsf{y})\subseteq pt(aS_{\_}i\mathsf{.f})$ .

These set constraints can be solved using a simple fix-point iteration, shown in Algorithm 1. (Our tool uses a more efficient implementation [27].) For now, ignore the loop on line 17. Once the solution is computed, we check all assertions in the program. We say that an assertion $\mathsf{assert\;(x\neq\textsc{Null})}$ is safe (i.e., the assertion cannot be violated) if $aS_{0}\not\in pt(\mathsf{x})$ . We do not consider other kinds of assertions in the program because our goal is just to show null-exception safety. Andersen’s analysis complexity is cubic in the size of the program, and for k-sparse programs, the worst case complexity is quadratic [28].

2.3 Static Single Assignment (SSA)

This section shows how a program transformation can improve the precision of an alias analysis. Consider the following program.

⬇

y $:$$=$ x.f;

x $:$$=$ Null;

A flow-insensitive analysis does not look at the order of statements. Under this abstraction, the analysis cannot prove the safety of the assertion in the above snippet of code because it does not know that the assignment of Null to $\mathsf{x}$ only happens after the assertion.

To avoid such loss in precision, most practical implementations of alias analysis use the Single Static Assignment (SSA) form [7]. Roughly, SSA introduces multiple copies of each original variable such that each variable in the new program only has a single assignment. The following is the SSA form of the snippet shown in the beginning of this section.

⬇

y $:$$=$ x1.f;

x2 $:$$=$ Null;

Clearly, this program has the same semantics as the original program. But a flow-insensitive analysis will now be able to show the safety of the assertion in the program because the assignment of Null is to $\mathsf{x2}$ whereas the assertion is on $\mathsf{x1}$ .

3 Overview

This section presents an overview of our technique of using stronger program transformations that add even more precision to the alias analysis compared to the standard SSA. We start by using Common Subexpression Elimination [6] and build towards using Global Value Numbering [17], which is used in our implementation and experiments.

3.1 Common Subexpression Elimination

We demonstrate how we can leverage $\mathsf{assume}$ and $\mathsf{assert}$ statements to add precision to the analysis. Consider the following program.

⬇

assert (x $\neq$ Null);

z $:$$=$ x.f;

Once the program control passes the $\mathsf{assume}$ statement, we know that $\mathsf{x}$ cannot point to Null, hence the assertion is safe, irrespective of what preceded this code snippet. Note that SSA renaming does not help prove the assertion in this case (it is essentially a no-op for the above snippet). We now make the case for a different program transformation.

As a first step, we introduce a new local variable $\mathsf{cseTmp^{\#}}$ to the procedure and assign it the value of $\mathsf{x}$ right after the $\mathsf{assume}$ . These new variables that we introduce to the program will carry the tag “ $\#$ ” to distinguish them from other program variables. For a tagged variable $\mathsf{w^{\#}}$ , we say that $\textit{tagged}(\mathsf{w}^{\#})$ is true. These tagged variables carry the special invariant that they cannot be Null; their only assignment will be after an assume statement that prunes away the Null value.

After introducing the variable $\mathsf{cseTmp^{\#}}$ , we make use of a technique similar to Common Subexpression Elimination (CSE) to replace all expressions that are equivalent to $\mathsf{cseTmp^{\#}}$ with the variable itself, resulting in the following program:

⬇

y $:$$=$ cseTmp#;

assert (cseTmp# $\neq$ Null);

z $:$$=$ cseTmp#.g;

This snippet is clearly equivalent to the original one. We perform the alias analysis on this snippet as usual, but enforce that $pt(\mathsf{cseTmp^{\#}})$ cannot have $aS_{0}$ because it cannot be Null. (See the loop on line 17 of Algorithm 1.) The analysis can now prove that the assertion is safe.

The process of finding equivalent expressions is not trivial. For instance, consider the following program where we have introduced the variable $\mathsf{cseTmp^{\#}}$ .

⬇

y.f $:$$=$ z;

z $:$$=$ x.f;

In the last assignment, $\mathsf{x.f}$ cannot be substituted by $\mathsf{cseTmp^{\#}}$ , because there is an assignment to the field $\mathsf{f}$ in the previous statement. As there is no aliasing information present at this point, we have to conservatively assume that $\mathsf{y}$ and $\mathsf{x}$ could be aliases, thus, the assignment $\mathsf{y.f:=z}$ can potentially change the value of $\mathsf{x.f}$ , breaking its equivalence to $\mathsf{cseTmp^{\#}}$ .

3.2 Global Value Numbering

We improve upon the previous transformation by using a stronger method of determining expression equalities. The methodology remains the same: we introduce temporary variables that cannot be Null and use them to replace syntactically equivalent expressions. But this time we adapt the Global Value Numbering (GVN) scheme to detect equivalent expressions. Consider the following program. (For now, ignore the right-hand side of the figure after the $\Longrightarrow$ .)

⬇

1y $:$$=$ x.f.g; $\Longrightarrow$ $t_{1}\leftarrow\mathsf{x},\;t_{2}\leftarrow t_{1}\mathsf{.f},\;t_{3}\leftarrow t_{2}\mathsf{.g},\;\mathsf{y}\leftarrow t_{3}$

2z $:$$=$ y.h; $\Longrightarrow$ $t_{3}\leftarrow\mathsf{y},\;t_{4}\leftarrow t_{3}\mathsf{.h},\;\mathsf{z}\leftarrow t_{4}$

3assume (z $\neq$ Null); $\Longrightarrow$ set $t_{4}\in\{\textit{nonNullExprs}\}$

4a $:$$=$ x.f; $\Longrightarrow$ $t_{1}\leftarrow\mathsf{x},\;t_{2}\leftarrow t_{1}\mathsf{.f},\;\mathsf{a}\leftarrow t_{2}$

5b $:$$=$ a.g.h; $\Longrightarrow$ $t_{2}\leftarrow\mathsf{a},\;t_{3}\leftarrow t_{2}\mathsf{.g},\;t_{4}\leftarrow t_{3}\mathsf{.h},\;\mathsf{b}\leftarrow t_{4}$

6assert (b $\neq$ Null); $\Longrightarrow$ check $t_{4}\in\{\textit{nonNullExprs}\}$

7c.g $:$$=$ d;

It is clear that $\mathsf{z}$ and $\mathsf{b}$ are equivalent at the assertion location, and since $\mathsf{z\neq\textsc{Null}}$ , the assertion is safe. However, none of the previous methods would allow us to prove the safety of the assertion. We adapt the GVN scheme to help us establish the equality between $\mathsf{z}$ and $\mathsf{b}$ . We introduce the concept of terms that will be used as a placeholder for subexpressions. The intuitive idea is that equivalent subexpressions will be represented using the same term.

We start by giving an overview of the transformation for a single basic block, and then generalize it to full procedure later in this section. For a single basic block, we walk through the statements in order and as we encounter a new variable, we assign it a new term and remember this mapping in a dictionary called $hashValue$ . We also store the mapping from terms to other terms through operators in a separate dictionary called hashFunction. For example, if $\mathsf{x}$ is assigned term $t_{\_}1$ , and we encounter the assignment $\mathsf{y:=x.f}$ , we store $\textit{hashFunction}[\mathsf{f}][t_{\_}1]=t_{\_}2$ and assign the term $t_{\_}2$ to $\mathsf{y}$ . We also maintain a separate list nonNullExprs of terms that are not null. Finally, for performing the actual substitution, we maintain a dictionary defaultVar that maps terms to the temporary variables that we introduce for non-null expressions.

We go through the program snippet starting at the first statement and move down to the last statement. At statement $i$ , we follow the description written in the $i^{\text{th}}$ item below. This description is also shown on the right side of the program snippet, after the $\Longrightarrow$ arrow.

Assign a new term $t_{\_}1$ to $\mathsf{x}$ , and set $hashValue[\mathsf{x}]=t_{\_}1$ . Then, set $\textit{hashFunction}[\mathsf{f}][t_{\_}1]=t_{\_}2$ , and $\textit{hashFunction}[\mathsf{g}][t_{\_}2]=t_{\_}3$ . Finally the assignment to $\mathsf{y}$ adds $hashValue[\mathsf{y}]=t_{\_}3$ . 2. 2.

We already have $hashValue[\mathsf{y}]=t_{\_}3$ , so assign $\textit{hashFunction}[\mathsf{g}][t_{\_}3]=t_{\_}4$ . The assignment adds $hashValue[\mathsf{z}]=t_{\_}4$ . 3. 3.

We have $hashValue[\mathsf{z}]=t_{\_}4$ . So, we add $t_{\_}4$ to nonNullExprs. We create a new temporary variable $\mathsf{gvnTmp}^{\#}$ , and construct an extra assignment $\mathsf{gvnTmp^{\#}:=z}$ , and add it after the $\mathsf{assume}$ statement. Because $hashValue[\mathsf{z}]=t_{\_}4$ , we also add $\textit{defaultVar}[t_{\_}4]=\mathsf{gvnTmp^{\#}}$ , which we will use later for substitutions to all expressions that hash to $t_{\_}4$ . 4. 4.

We already have $hashValue[\mathsf{x}]=t_{\_}1$ and $\textit{hashFunction}[\mathsf{f}][t_{\_}1]=t_{\_}2$ , so we add $hashValue[\mathsf{a}]=t_{\_}2$ . 5. 5.

We have $hashValue[\mathsf{a}]=t_{\_}2$ , $\textit{hashFunction}[\mathsf{g}][t_{\_}2]=t_{\_}3$ and $\textit{hashFunction}[\mathsf{h}][t_{\_}3]=t_{\_}4$ . Hence, the hash value of the expression $\mathsf{a.g.h}$ is $t_{\_}4$ . We also have $\textit{defaultVar}[t_{\_}4]=\mathsf{gvnTmp^{\#}}$ . At this point, we observe $t_{\_}4$ being in nonNullExprs and substitute the RHS $\mathsf{a.g.h}$ with $\mathsf{gvnTmp^{\#}}$ . Finally, we add $hashValue[\mathsf{b}]=t_{\_}4$ . 6. 6.

Because $hashValue[\mathsf{b}]=t_{\_}4$ and $\textit{defaultVar}[t_{\_}4]=\mathsf{gvnTmp^{\#}}$ and nonNullExprs contains $t_{\_}4$ , we replace the expression $\mathsf{b}$ with $\mathsf{gvnTmp^{\#}}$ .

The resulting code is shown below.

⬇

1y $:$$=$ x.f.g;

2z $:$$=$ y.h;

3assume (z $\neq$ Null);

4gvnTmp# $:$$=$ z;

5a $:$$=$ x.f;

6b $:$$=$ gvnTmp#;

7assert (gvnTmp# $\neq$ Null);

8c.g $:$$=$ d;

Clearly, we retain the invariant that $\#$ -tagged variables cannot be Null, and it is now straightforward to prove the safety of the assertion. We also note that the expression substitution is performed in a conservative manner. It is aborted as soon as a subexpression is assigned to. For example, at line 8, we encounter an assignment to the field $\mathsf{g}$ , so we remove $\mathsf{g}$ from the dictionary hashFunction. This has the effect of $\mathsf{g}$ acting as a new field, and all terms referenced by this field will now be assigned new terms.

The above transformation, in general, is performed on the entire procedure, not just a basic block to fully exploit its potential. This occurs in two steps. First, loops are lifted and converted to procedures (with recursion), so that the control-flow of each resulting procedure is acyclic. Next, we perform a topological sort [3] of the basic blocks of a procedure and analyze the blocks in this order. This ensures that by the time the algorithm visits a basic block, it has already processed all predecessors of the block.

When analyzing a block, the algorithm considers all its predecessors and takes the intersection of their nonNullExprs list and $hashValue$ map. This is because an expression is non-null only if it is non-null in all its predecessors and, further, we can use a term for a variable only if it is associated with the same term in all its predecessors. Finally, an important aspect of the algorithm is to perform a sound substitution at the merge point of two basic blocks. Consider the code snippet below.

⬇

gvnTmp ${}^{\#}_{1}$ $:$$=$ x;

goto L3;

L2 $\colon$

assume (x $\neq$ Null);

gvnTmp ${}^{\#}_{2}$ $:$$=$ x;

goto L3;

L3 $\colon$

assert (x $\neq$ Null);

In this example, although $\mathsf{x}$ is available as a non-null expression in $\mathsf{L3}$ , we cannot substitute $\mathsf{x}$ in the assertion by either $\mathsf{gvnTmp}^{\#}_{\_}1$ or $\mathsf{gvnTmp}^{\#}_{\_}2$ because neither preserves program semantics. Instead, we introduce a new variable $\mathsf{gvnTmp}^{\#}_{\_}3$ and add the assignment $\mathsf{gvnTmp}^{\#}_{\_}3:=\mathsf{x}$ right before the assertion in $\mathsf{L3}$ and use that for substituting $\mathsf{x}$ . This is achieved by the map $var2expr$ in the main algorithm. It maps the current block and the $\#$ -tagged variable to the expression it will substitute for. In the above program, let’s say we assign the term $t$ to the non-null expression $\mathsf{x}$ . Hence, $\textit{nonNullExprs}[\mathsf{L1}]$ and $\textit{nonNullExprs}[\mathsf{L2}]$ both contain $t$ . We also have $\textit{defaultVar}[\mathsf{L1}][t]=\mathsf{gvnTmp}^{\#}_{\_}1$ and $var2expr[\mathsf{gvnTmp}^{\#}_{\_}1]=\mathsf{x}$ . Since $t$ is available from all predecessors of $\mathsf{L3}$ , we know that this term is non-null in $\mathsf{L3}$ . The question is finding the expression corresponding to this term and introducing a new assignment for it. At this point, the map $var2expr$ comes into play. We pick a predecessor of $\mathsf{L3}$ , say $\mathsf{L1}$ . We look for the default variable of $t$ and find $\textit{defaultVar}[\mathsf{L1}][t]=\mathsf{gvnTmp}^{\#}_{\_}1$ , we then search for $var2expr[\mathsf{gvnTmp}^{\#}_{\_}1]=\mathsf{x}$ . At this point, we find that the expression corresponding to term $t$ is $\mathsf{x}$ , and we introduce a new assignment $\mathsf{gvnTmp}^{\#}_{\_}3:=\mathsf{x}$ at the start of $\mathsf{L3}$ and use this for substitution of $\mathsf{x}$ . With these motivating examples, the next section describes the algorithm formally.

4 Algorithm

We present the pseudocode of our program transformation in this section (Algorithms 2 and 3). The transformation takes a program as input and produces a semantically equivalent program with new $\#$ -tagged variables that can never be Null. This involves adding assignments for these new variables, and substituting existing expressions with these variables whenever we determine that the substitution will preserve semantics.

At a high level, the idea is to use $\mathsf{assume}$ and $\mathsf{assert}$ statements to identify non-null expressions. We introduce fresh $\#$ -tagged variables and assign these non-null expressions to them. Then, in a second pass, we compute a term corresponding to each expression. These terms are assigned in a manner that if two expressions have the same term, then they are equivalent to each other. If we encounter an expression $e$ with the same term as one of the non-null expressions $e^{\prime}$ , we substitute $e$ with the $\#$ -tagged variable corresponding to $e^{\prime}$ .

We start by describing the role of each data structure used in Algorithm 2.

•

nonNullExprs: For each block, this stores the terms of non-null expressions for that block.

•

$var2expr$ : This maps a $\#$ -tagged variable to the expression it is assigned to in each block. This will be used to solve the issue discussed in the last example of Section 3.2.

•

defaultVar: This maps the term corresponding to an expression to the $\#$ -tagged variable that will be used for its substitution. Whenever we compute the term for an expression, if the term is present in nonNullExprs, we will use defaultVar to find the $\#$ -tagged variable that is going to be used for the substitution.

•

$hashValue$ : It stores the term assigned to each variable in a particular block.

•

hashFunction: It stores the mapping from a field and a term to a new term. It is used to store the term for expressions with fields.

•

$currBlock$ : It keeps track of the current block and is used while calling the helper functions.

We now explain the algorithm step by step.

Lines 8 - 15 – In this first pass of the algorithm, we search for program statements of the form “ $\mathsf{assert\;expr\neq\textsc{Null}}$ ” or “ $\mathsf{assume\;expr\neq\textsc{Null}}$ ”. This guarantees that $\mathsf{expr}$ cannot be Null after this program location under all executions. Hence, we introduce a new variable $\mathsf{gvnTmp}^{\#}$ and assign $\mathsf{expr}$ to it. This mapping is also added to $var2expr$ . 2. 2.

Line 17 – Before doing the second pass, we perform a topological sort on the blocks according to the control flow graph. This is necessary since we need the information of nonNullExprs for the predecessors of a basic block before analyzing it. Note that conflow-flow graphs of procedures in our language must be acyclic (we convert loops to recursion), thus a topological sorting always succeeds. 3. 3.

Lines 18 - 27 – We compute the set of expressions that are non-null in all predecessors. Only these expressions will be non-null in the current block. We also need the term for each variable in the current block, which also comes from the intersection of terms from all predecessors. Finally, for all the non-null expressions, we add an assignment since these expressions may be available from different variables in different predecessors, as discussed in Section 3.2. 4. 4.

Lines 28 - 33 – Finally, we process each statement in the current block. This performs the substitution for each expression in the statement ( $GetExpr$ function in Algorithm 3). $GetExpr$ computes the term for the expression ( $ComputeHash$ function in Algorithm 3), and if the term is contained in nonNullExprs, the substitution is performed. Finally, if we encounter a store statement, “ $\mathsf{v.f:=expr}$ ”, we remove all mappings w.r.t. $\mathsf{f}$ in hashFunction. So, for the future statements (and future blocks in the topological order), new terms will be assigned to expressions related to field $\mathsf{f}$ .

With this pseudocode, we will generate a semantically equivalent program, and as we show in our experiments, will have improved precision with regard to alias analysis. The main reason behind this improvement is that these $\#$ -tagged variables can never contain $aS_{0}$ in the points-to set, hence $aS_{0}$ cannot flow through these variables in the analysis, while earlier, there was no such restriction and Null could flow freely. The pseudocode for the algorithm is demonstrated in Algorithms 2 and 3.

5 Experimental Evaluation

We have implemented the algorithms presented in this paper for the Boogie language [20]. Boogie is an intermediate verification language. Several front-ends are available that compile source languages (such as C/C++ [24, 18] and C# [4]) to Boogie, making it a useful target for developing practical tools.

Our work fits into a broader verification effort. The Angelic Verification (AV) project222https://www.microsoft.com/en-us/research/project/angelic-verification/ at Microsoft Research aims to design push-button technology for finding software defects. In an earlier effort, AV was targeted to find null-pointer bugs [8]. Programs from the Windows codebase, in C/C++, were compiled down to Boogie with assertions guarding every pointer access to check for null dereferences. These Boogie programs were fed to a verification pipeline that applied heavyweight SMT-solver technology to reason over all possible program behaviors. To optimize the verification time, an alias analysis is run at the Boogie level to remove assertions that can be proved safe by the analysis. As our results will show, this optimization is necessary. The alias analysis is based on Andersen’s analysis, as was described in Figure 2. We follow the algorithm given in Sridharan et al.’s report [27] with the extra constraint that $\#$ -tagged variables cannot alias with Null, i.e. they cannot contain the allocation site $aS_{0}$ . We can optionally perform the program transformation of Section 4 before running the alias analysis. Our implementation is available open-source333At https://github.com/boogie-org/corral, project AddOns $\backslash$ AliasAnalysis.

We evaluate the effect of our program transformation on the precision of alias analysis for checking safety of null-pointer assertions. The benchmarks are described in the first four columns of Table 1. We picked $16$ different modules from the Windows codebase. The table lists an anonymized name for the module (Bench), the number of procedures contained in the module (Procs), the lines of code in thousands (KLOC) and the number of assertions (one per pointer dereference) in the code (Asserts). It is worth noting that the first ten modules are the same as ones used in the study with AV [8], while the rest were added later.

We execute our tool in two modes. In the first, we use SSA and then run the alias analysis algorithm. In the second, we perform our GVN transformation on top of SSA and then run the alias analysis algorithm. In each case, we list the total time taken by the tool (Time(s)), including the time to run the transformation, and the number of assertions that were not proved safe (Asserts). In the case of GVN, we also isolate and list the time taken by the GVN transformation itself (GVN(s)).

The experiments were run (sequentially, single-threaded) on a server class machine with an Intel(R) Xeon(R) processor (single core) executing at 2.4 GHz with 32 GB RAM.

It is clear from the table that GVN offers significant increase in precision. With only the use of SSA, the analysis was able to prove the safety of $86.56\%$ of assertions, while with the GVN transformation, we can prune away $98.05\%$ of assertions. This is approximately a $7$ X reduction in the number of assertions that remain. This pruning is surprising because the alias analysis is still context and flow insensitive. Our program transformation crucially exploits the fact that programmers tend to be defensive against null-pointer bugs, allowing the analysis to get away with a very coarse abstraction. In fact, this level of pruning meant that any level of investment in making the analysis more sophisticated (e.g., flow sensitive) would have very diminished returns.

The alias analysis itself scales quite well: it finishes on about half a million lines of code in approximately $700$ seconds with just SSA ( $86.56\%$ pruning) or $2000$ seconds with GVN ( $98.05\%$ pruning). We note that there is an increase in the running time when using GVN. This happens because the transformation introduces more variables, compared to just SSA. However, this increase in time is more than offset by the improvement presented to the AV toolchain. For example, with the GVN transformation, AV takes $11$ hours to finish the first $10$ modules, whereas with the SSA transformation alone it does not finish even in $24$ hours. Furthermore, AV reports fewer bugs when using just SSA because the extra computational efforts translate to a loss in program coverage as timeouts are hit more frequently.

6 Related Work

Pointer analysis is a well-researched branch of static analysis. There are several techniques proposed that interplay between context, flow and field sensitivity. Our choice of using context-insensitive, flow-insensitive but field sensitive analysis is to pick a scalable starting point, after which we add precision at low cost. The distinguishing factor in our work is: $(1)$ the ability to leverage information from $\mathsf{assume}$ and $\mathsf{assert}$ statements (or branch conditions) and $(2)$ specializing for the purpose of checking non-null assertions (as opposed to general aliasing assertions). We very briefly list, in the rest of this section, some of the previous work in adding precision to alias analysis or making it more scalable.

Context Sensitivity.

Sharir and Pnueli [26] introduced the concept of call-strings to add context-sensitivity to static analysis techniques. Call strings may grow extremely long and limit efficiency, so Lhoták and Hendren [22] used k-limiting approaches to limit the size of call strings. Whaley and Lam [30] instead use Binary Decision Diagrams (BDDs) to scale a context sensitive analysis.

Flow sensitivity.

Hardekopf and Lin [12] present a staged flow-sensitive analysis where a less precise auxiliary pointer analysis computes def-use chains which is used to enable the sparsity of the primary flow-sensitive analysis. The technique is quite scalable on large benchmarks but they abstract away the assume statements. De and D’Souza [9] compute a map from access paths to sets of abstract objects at each program statement. This enables them to perform strong updates at indirect assignments. The technique is shown to be scalable only for small benchmarks, moreover, they also abstract away all assume statements. Finally, Lerch et al. [21] introduce the access-path abstraction, where access paths rooted at the same base variable are represented by this base variable at control flow merge points. The technique is quite expensive even on small benchmarks (less than 25 KLOC) and do not deal with assume statements in any way.

Other techniques.

Heintze and Tardieu [14] improved performance by using a demand-driven pointer analysis, computing sufficient information to only determine points-to set of query variables. Fink et al. [10] developed a staged verification system, where faster and naive techniques run as early stages to prune away assertions that are easier to prove, which then reduces the load on more precise but slow techniques that run later. Landi and Ryder [19] use conditional may alias information to over-approximate the points-to sets of each pointer. Context sensitivity is added using k-limiting approach, and a set of aliases is maintained for every statement within a procedure to achieve flow-sensitivity. Choi et al. [5] also follows [19] closely but uses sparse representations for the control flow graphs and use transfer functions instead of alias-generating rules. To the best of our knowledge, none of these techniques are able to leverage $\mathsf{assume}$ statements to improve precision.

7 Conclusion

This paper presents a program transformation that improves the efficiency of alias analysis with minor scalability overhead. The transformation is proved to be semantics preserving. Our evaluation demonstrates the merit of our approach on a practical end-to-end scenario of finding null-pointer dereferences in software.

Appendix A Proof of Correctness

We sketch the proof of the fact that our transformation preserves semantics. To substitute an expression, say $\mathsf{expr}$ with a variable, say $\mathsf{v}$ at a program location $L$ , we need to prove the following two conditions.

•

Assignment of $\mathsf{v}$ reaches $L$ along every execution path.

•

$\mathsf{expr}$ and $\mathsf{v}$ evaluate to the same value at $L$ under all possible executions.

A.1 First Condition

We begin by proving the first condition. Note that substitution only occurs in the function $GetExpr()$ , (line 18 in Algorithm 3) and only by variables present in the map $defaultVar$ . Also, only $\#$ -tagged variables are added to $defaultVar$ (line 33 in Algorithm 2). Hence, $\mathsf{v}$ is tagged with $\#$ . As is clear from the algorithm, such a variable is assigned either at line 13 or line 26 in Algorithm 2. Moreover, in both cases, this variable is generated afresh before constructing the assignment. Hence, $\mathsf{v}$ is assigned only once. Let this assignment location be $S$ . It suffices to show that $S$ dominates $L$ (Location $A$ dominates location $B$ is every path from the entry block to $B$ goes through $A$ ). We will prove this using strong induction on the blocks sorted in the topological order (making it a well founded set). For the sake of convenience, let us say that $S$ and $L$ are basic blocks. Hence, the statement that we will prove is the following.

Lemma 1.

$P(B)\rightarrow$ * $\mathsf{expr}$ can be substituted by $\mathsf{v}$ in block $B$ $\Rightarrow$ $S$ dominates $B$ .*

Proof.

Since $S$ dominates itself, $P(S)$ is trivially true. Now, consider $P(B)$ . When we arrive at block $B$ in the second pass, we have already processed all predecessors of $B$ since we process blocks in the topological order. Let $t=ComputeHash(\mathsf{expr})$ .

Lemma 2.

$\mathsf{expr}$ * can be substituted by $\mathsf{v}$ in block $B$ $\Rightarrow$ $nonNullExprs[B]$ contains $t$ .*

Proof.

Substitution occurs at line 18 of Algorithm 3, which can only be reached if line 17 holds. ∎

Now, for $nonNullExprs[B]$ to contain $t$ , $nonNullExprs[blk]$ should also contain $t$ for all blocks $blk\in B.Preds$ , i.e. all predecessors of $B$ . This follows from line 19 in Algorithm 2.

Lemma 3.

$nonNullExprs[B]$ * contains $t\Rightarrow S$ dominates $B$ .*

Proof.

We show Lemma 3 using strong induction on the blocks sorted in topological order. Clearly, Lemma 3 holds for $S$ , as $nonNullExprs[S]$ contains $t$ and $S$ dominates itself. Since $nonNullExprs[B]$ contains $t$ , we know, due to line 19 in Algorithm 2, that $nonNullExprs[blk]$ contains $t$ for all predecessors $blk$ of $B$ . Now, by induction hypothesis, since the lemma holds for all predecessors of $B$ , $S$ dominates all predecessors of $B$ . This implies that $S$ dominates $B$ . ∎

Lemmas 2 and 3 together imply Lemma 1, which is a reformulation of the first condition of the proof of correctness. ∎

A.2 Second Condition

Let us now prove the second condition.

Lemma 4.

If two expressions $\mathsf{e_{\_}1}$ and $\mathsf{e_{\_}2}$ at locations $L_{\_}1$ and $L_{\_}2$ respectively evaluate to the same term $t=ComputeHash(\mathsf{e_{\_}1})=ComputeHash(\mathsf{e_{\_}2})$ , then $\mathsf{e_{\_}1}$ at $L_{\_}1$ and $\mathsf{e_{\_}2}$ at $L_{\_}2$ evaluate to the same value under all program executions.

Proof.

We prove this lemma by an outer induction on the structure of the expression, and an inner induction on the blocks sorted in the topological order. First, we prove this lemma when $\mathsf{e_{\_}1}$ and $\mathsf{e_{\_}2}$ are both variables. The map $hashValue$ stores the terms corresponding to each variable in a particular block. Therefore, $ComputeHash(\mathsf{e_{\_}1})=hashValue[\mathsf{e_{\_}1}]$ , which implies $hashValue[L_{\_}1][\mathsf{e_{\_}1}]=hashValue[L_{\_}2][\mathsf{e_{\_}2}]$ . Also, by line 20 in Algorithm 2, we know that for a block $B$ , $hashValue[B]$ contains a variable $\mathsf{v}$ only if it evaluates to the same term in all its predecessors. By the inner induction hypothesis, this means that $\mathsf{v}$ evaluates to the same value in each predecessor. Also, whenever a statement of form “ $\mathsf{y:=x}$ ” is encountered, the term for $\mathsf{x}$ is assigned to $\mathsf{y}$ . Since this is the only way that two variables can have the same term, we have the proof of Lemma 4 for variables.

Now, consider the case when $\mathsf{e_{\_}1}$ and $\mathsf{e_{\_}2}$ are arbitrary expressions. Suppose $\mathsf{e_{\_}1}$ has the form $\mathsf{v_{\_}1.f}$ , while $\mathsf{e_{\_}2}$ has the form $\mathsf{v_{\_}2.g}$ . Since $ComputeHash(\mathsf{e_{\_}1})=ComputeHash(\mathsf{e_{\_}2})$ , we know that $\mathsf{f}=\mathsf{g}$ , and $ComputeHash(\mathsf{v_{\_}1})=ComputeHash(\mathsf{v_{\_}2})$ . This is easy to see from the fact that whenever $hashFunction$ is updated (line 31 in Algorithm 3), a new term is added to it. Now, by the outer induction hypothesis, we have that $\mathsf{v_{\_}1}$ and $\mathsf{v_{\_}2}$ evaluate to the same value in all executions and since $\mathsf{f}=\mathsf{g}$ , we have that $\mathsf{e_{\_}1}$ and $\mathsf{e_{\_}2}$ evaluate to the same value under all executions. That concludes the proof of Lemma 4. ∎

Essentially, Lemma 4 entails that term is an abstract representation of the value of an expression. Going back to our original proof of correctness, the variable $\mathsf{v}$ substitutes expression $\mathsf{expr}$ (at line 18 in Algorithm 3) only when $\mathsf{v}=defaultVar[L][ComputeHash(\mathsf{expr})]$ . Also, $defaultVar$ is updated only when an assignment of the form $\mathsf{v:=expr}$ is encountered (line 33 in Algorithm 3), and before this update, $ProcessStmt$ is called on the assignment. This sets $hashValue[S][\mathsf{v}]=ComputeHash(\mathsf{expr})$ . Combining the two arguments above, we have

[TABLE]

The last line follows from the fact that $\mathsf{v}$ is tagged with $\#$ , hence it is assigned only once, and it is available at $L$ due to the first condition in the proof of correctness. With the two conditions proved, we have that the transformation introduced in Algorithm 2 produces a semantically equivalent program, and executing an alias analysis algorithm on the new program will not add any false positives.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2[2] Lars O. Andersen (1994): Program Analysis and Specialization for the C Programming Language . Ph.D. thesis, DIKU, University of Copenhagen.
3[3] Bengt Aspvall, Michael F. Plass & Robert Endre Tarjan (1979): A linear-time algorithm for testing the truth of certain quantified boolean formulas . Information Processing Letters 8(3), pp. 121 – 123.
4[4] Michael Barnett & Shaz Qadeer (2012): BCT: A translator from MSIL to Boogie . Seventh Workshop on Bytecode Semantics, Verification, Analysis and Transformation.
5[5] Jong-Deok Choi, Michael Burke & Paul Carini (1993): Efficient Flow-sensitive Interprocedural Computation of Pointer-induced Aliases and Side Effects . In: Principles of Programming Languages , pp. 232–245.
6[6] John Cocke (1970): Global Common Subexpression Elimination . In: Proceedings of a Symposium on Compiler Optimization , ACM, New York, NY, USA, pp. 20–24.
7[7] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman & F. Kenneth Zadeck (1991): Efficiently Computing Static Single Assignment Form and the Control Dependence Graph . ACM Trans. Program. Lang. Syst. 13(4), pp. 451–490.
8[8] Ankush Das, Shuvendu K. Lahiri, Akash Lal & Yi Li (2015): Angelic Verification: Precise Verification Modulo Unknowns , pp. 324–342. Springer International Publishing, Cham.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Precise Null Pointer Analysis Through

Abstract

1 Introduction

2 Background

2.1 Programming Language

2.2 Alias Analysis

2.3 Static Single Assignment (SSA)

3 Overview

3.1 Common Subexpression Elimination

3.2 Global Value Numbering

4 Algorithm

5 Experimental Evaluation

6 Related Work

Context Sensitivity.

Flow sensitivity.

Other techniques.

7 Conclusion

Appendix A Proof of Correctness

A.1 First Condition

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

A.2 Second Condition

Lemma 4**.**

Proof.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.