TL;DR
This paper presents a decision procedure for verifying memory safety in heap-manipulating programs that process forest data-structures in a single pass, demonstrating efficiency and practical applicability.
Contribution
It extends decidability results to alias-aware programs with map updates and develops algorithms for verifying memory safety in single-pass heap programs.
Findings
Decidable when initial heap is a forest and programs are streaming-coherent.
Verification algorithms are efficient for common single-pass data-structure algorithms.
Experimental results confirm practical applicability on library routines.
Abstract
We investigate the decidability of automatic program verification for programs that manipulate heaps, and in particular, decision procedures for proving memory safety for them. We extend recent work that identified a decidable subclass of uninterpreted programs to a class of alias-aware programs that can update maps. We apply this theory to develop verification algorithms for memory safety--- determining if a heap-manipulating program that allocates and frees memory locations and manipulates heap pointers does not dereference an unallocated memory location. We show that this problem is decidable when the initial allocated heap forms a forest data-structure and when programs are streaming-coherent, which intuitively restricts programs to make a single pass over a data-structure. Our experimental evaluation on a set of library routines that manipulate forest data-structures shows that…
| Program | LOC | Streaming- | Found | # States | Time |
| coherent? | Safe | (ms) | |||
| Verification of Memory Safe Programs | |||||
| sll-append-safe | 19 | yes | ✓ | 4 | 4 |
| sll-copy-all-safe | 27 | yes | ✓ | 6 | 3 |
| sll-delete-all-safe | 56 | yes | ✓ | 58 | 9 |
| sll-deletebetween-safe | 42 | yes | ✓ | 53 | 8 |
| sll-find-safe | 16 | yes | ✓ | 4 | 3 |
| sll-insert-back-safe | 20 | yes | ✓ | 3 | 3 |
| sll-insert-front-safe | 8 | yes | ✓ | 1 | 3 |
| sll-insert-safe | 50 | yes | ✓ | 12 | 4 |
| sll-reverse-safe | 12 | yes | ✓ | 3 | 3 |
| sll-sorted-concat-safe | 17 | yes | ✓ | 4 | 3 |
| sll-sorted-insert-safe | 50 | yes | ✓ | 12 | 4 |
| sll-sorted-merge-safe | 74 | yes | ✓ | 62 | 8 |
| bst-find-safe | 23 | yes | ✓ | 21 | 4 |
| bst-insert-safe | 45 | yes | ✓ | 29 | 6 |
| bst-remove-root-safe | 52 | yes | ✓ | 12 | 4 |
| avl-balance-safe | 190 | yes | ✓ | 48 | 12 |
| tree-rotate-left-safe | 25 | yes | ✓ | 3 | 3 |
| Finding Errors in Memory-unsafe Programs | |||||
| sll-append-unsafe | 20 | yes | ✗ | — | 3 |
| sll-copy-all-unsafe | 29 | yes | ✗ | — | 4 |
| sll-delete-all-unsafe | 58 | yes | ✗ | — | 5 |
| sll-deletebetween-unsafe | 44 | yes | ✗ | — | 4 |
| sll-find-unsafe | 18 | yes | ✗ | — | 3 |
| sll-insert-back-unsafe | 20 | yes | ✗ | — | 3 |
| sll-insert-front-unsafe | 9 | yes | ✗ | — | 3 |
| sll-insert-unsafe | 50 | yes | ✗ | — | 3 |
| sll-reverse-unsafe | 12 | yes | ✗ | — | 3 |
| sll-sorted-concat-unsafe | 17 | yes | ✗ | — | 3 |
| sll-sorted-insert-unsafe | 50 | yes | ✗ | — | 3 |
| sll-sorted-merge-unsafe-1 | 69 | yes | ✗ | — | 4 |
| sll-sorted-merge-unsafe-2 | 63 | yes | ✗ | — | 3 |
| bst-find-unsafe | 25 | yes | ✗ | — | 4 |
| bst-insert-unsafe | 49 | yes | ✗ | — | 6 |
| bst-remove-root-unsafe | 54 | yes | ✗ | — | 3 |
| avl-balance-unsafe | 111 | yes | ✗ | — | 3 |
| tree-rotate-left-unsafe | 20 | yes | ✗ | — | 3 |
| Detecting Programs are not | |||||
| streaming-coherent | |||||
| sll-sorted-merge-non-streaming-coherent | 75 | no | — | — | 4 |
| bst-remove-root-non-streaming-coherent | 55 | no | — | — | 3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Deciding Memory Safety for Single-Pass Heap-Manipulating Programs
Umang Mathur
Department of Computer ScienceUniversity of Illinois, Urbana-ChampaignUSA
,
Adithya Murali
Department of Computer ScienceUniversity of Illinois, Urbana-ChampaignUSA
,
Paul Krogmeier
Department of Computer ScienceUniversity of Illinois, Urbana-ChampaignUSA
,
P. Madhusudan
Department of Computer ScienceUniversity of Illinois, Urbana-ChampaignUSA
and
Mahesh Viswanathan
Department of Computer ScienceUniversity of Illinois, Urbana-ChampaignUSA
(2020)
Abstract.
We investigate the decidability of automatic program verification for programs that manipulate heaps, and in particular, decision procedures for proving memory safety for them. We extend recent work that identified a decidable subclass of uninterpreted programs to a class of alias-aware programs that can update maps. We apply this theory to develop verification algorithms for memory safety— determining if a heap-manipulating program that allocates and frees memory locations and manipulates heap pointers does not dereference an unallocated memory location. We show that this problem is decidable when the initial allocated heap forms a forest data-structure and when programs are streaming-coherent, which intuitively restricts programs to make a single pass over a data-structure. Our experimental evaluation on a set of library routines that manipulate forest data-structures shows that common single-pass algorithms on data-structures often fall in the decidable class, and that our decision procedure is efficient in verifying them.
Memory Safety, Program Verification, Aliasing, Decidability, Uninterpreted Programs, Streaming-Coherence, Forest Data-Structures
††copyright: rightsretained††doi: 10.1145/3371103††journalyear: 2020††journal: PACMPL††journalvolume: 4††journalnumber: POPL††article: 35††publicationmonth: 1††ccs: Theory of computation Logic and verification††ccs: Theory of computation Automated reasoning
1. Introduction
The problem of automatic (safety) verification is to ascertain whether a program satisfies its assertions on all inputs and on all executions. The standard technique for proving programs correct involves writing inductive invariants in terms of loop invariants and pre/post conditions, and proving the resulting verification conditions valid (Floyd, 1967; Hoare, 1969). While there has been tremendous progress in identifying decidable fragments for checking validity of verification conditions (Nelson-Oppen combinations of decidable theories realized by efficient SMT solvers (Bradley and Manna, 2007)), decidable program verification without annotations has been elusive. Apart from programs over finite domains, very few natural decidable classes are known.
In a recent paper (Mathur et al., 2019a), a class of uninterpreted programs was identified and shown to have a decidable verification problem. Uninterpreted programs work over arbitrary data domains; the domains give meaning to the constants, relations, and functions in the program, interpreting equality using its natural definition. A program is deemed correct only if it satisfies its assertions in all executions and for all data domains. The authors show that for a class of programs that satisfies a coherence condition, verification is decidable. The decision procedure relies on a streaming congruence closure algorithm realized as automata. The results of (Mathur et al., 2019a) are, however, purely theoretical; there are no general application domains identified where uninterpreted program verification would be useful, nor is there any implementation.
The goal of this paper is to study completely automated verification for heap-manipulating programs. When modeling the heap, we can treat pointers as unary uninterpreted functions; this is a natural modeling choice that does not involve any abstraction (as pointer fields are really arbitrary, unary functions). Thus, it is reasonable to hope that uninterpreted program verification techniques could be useful in this setting.
However, there is a fundamental challenge that we need to overcome: heap-manipulating programs modify heap pointers. With pointers modeled as functions, these programs modify functions/maps. It turns out that the theory of uninterpreted program verification developed in (Mathur et al., 2019a) is severely inadequate for tackling programs that modify maps.
In this paper, we undertake a fundamental study of verification for programs that have updatable maps/functions. We then apply this theory to show that checking memory safety of heap-manipulating programs (where one is given a recursively-described set of allocated locations and asked to check whether a program only dereferences locations that are within this set) is decidable for a subclass of programs, and evaluate the algorithms with an implementation and experiments.
1.1. Alias Awareness and the Notion of Congruence for Programs with Updatable Maps
We can think of uninterpreted programs as computating terms using the function and constants symbols appearing in the code—in the beginning, each variable stores a constant, and after executing an assignment statement of the form , the term stored in is , where is the term stored in before executing this statement. As the program executes, in addition to computing terms, it places equality and disequality constraints on the terms it computes (inherited from the conditionals in if-then-else and while constructs).
However, when the program updates maps, this fundamental property, that the program computes terms, is destroyed. Consider an example where there are two locations pointed to by variables and , but we do not know whether and point to the same location or not. If we update a pointer-field on and then read the same pointer-field from into a variable , we cannot really associate any term with the variable , as the term crucially depends on whether and were aliases to the same location or not.
The above is a fundamental problem that wreaks havoc, making program verification essentially impossible in the presence of updatable maps. The notion of coherence in (Mathur et al., 2019a) fails, as it crucially relies on this notion of terms. And it cannot be repaired, as the semantic meaning of the coherence condition fundamentally involves not computing the same term multiple times.
The primary observation, which saves the framework, is to note that the entire problem is due to not knowing aliasing relationships. We call a program execution *alias-aware * if, at every point where it updates a function on the element pointed to by a program variable , the precise aliasing relationship between and all other program variables in scope is determined. More precisely, when the execution modifies a function/pointer-field of , for every other program variable , it must either be the case that is different from in any data-model/heap or it must be the case that is equal to in all data-models/heaps.
We show that alias-awareness is a panacea for our problems. For alias-aware programs (programs whose executions are all alias-aware), we show we can associate terms with variables after a computation that updates maps, and further show that the notion of coherence extends naturally to programs that update maps. We then show that for coherent alias-aware programs, the verification problem becomes decidable. These results constitute the first main contribution of the paper.
1.2. Application to Verifying Memory Safety
We then study the application of our framework to verifying memory safety. Our key observation is that for programs that manipulate forest data-structures (data-structures consisting of disjoint tree-like structures), programs are naturally alias-aware. Intuitively, when traversing forest data-structures, aliasing information is implicitly present. For instance, if x points to a location of a forest data-structure, we know that the location pointed to by x, the one pointed to by the left child , and the one pointed to by the right child are all different.
In this paper, we define memory safety as follows. A heap-manipulating program starts with a set of allocated heap locations. During its execution, it dereferences pointers on heap locations, and allocates and frees locations. A program is memory safe if it never dereferences a location that is not in the allocated set. The above definition of memory safety captures the usual categories of memory safety errors such as null-pointer dereferences, use after free, use of uninitialized memory, illegal freeing of memory, etc. (Hicks, 2014). However, in this paper, we do not consider allocation of contiguous blocks of arbitrary size of memory (and hence do not handle arrays and buffer overflows of arrays in languages like C, etc.). Rather, we assume that allocation is done in terms of records of fixed size (like structs in C), and we disallow pointer arithmetic in our programs.
Our second main contribution of the paper is a technical result that gives an efficient decision procedure for verifying memory safety for a subclass of imperative heap-manipulating programs, whose *initial * allocated heaps are restricted to forest data-structures.
Handling forest data-structures, i.e., disjoint lists and trees, is useful as they are ubiquitous. Note that we require only the initial heaps to be forest data-structures; the program can execute for an arbitrarily long time and create cycles/merges as it manipulates the structures. We model the primitive types and operations on them using uninterpreted functions and relations, similar in spirit to the way the work in (Mathur et al., 2019a) handles all data. The key insight here is that this is a reasonable modeling choice, since programs typically do not rely on the semantics of the primitive data domains in order to assure memory safety (we also show this empirically in experiments). The salient aspect of our work is that we model the pointers in the heap precisely using updatable maps, without resorting to any abstraction (classical automatic analysis of heap programs typically uses abstractions for heaps; for instance, shape analysis involves an abstraction of the heap locations to a finite abstract domain (Sagiv et al., 1999)).
We allow the user to specify the initial allocated set as the (unbounded) set of locations reachable from various locations (pointed to by certain program variables) using particular pointer fields, until a specific set of locations is reached. The memory safety problem is then to check whether such a program, starting from an arbitrary heap storing a forest data-structure, an arbitrary model for the primitive types, and with the specified allocated set, dereferences only those locations that are in the (potentially changing) allocated set, on all executions.
The above problem turns out to be undecidable (a direct consequence from (Mathur et al., 2019a), as even programs that do not manipulate heaps and have simple equality assertions yield undecidability). The main result of the paper is that for a class of programs called streaming-coherent programs, memory safety is decidable. Intuitively, these correspond to programs that traverse the forest data-structures in a single-pass.
Technical Challenges.
The primary challenge is dealing with updatable pointers and updatable sets (the latter are needed to model the set of allocated locations, which changes during program execution). As we show in our first set of results, it is crucial for a program to be alias aware. The fact that our initial data-structures are forests implicitly causes streaming-coherent programs to be alias aware. When two variables point to locations obtained using different traversals, we know they cannot alias to each other. Also, for streaming-coherent programs, we can keep track of whether traversals for any pair of variables are the same, and track their precise aliasing relationships.
The culmination of the ideas above is our result that verification of memory safety for streaming-coherent programs over forest data-structures is decidable, and is -complete. It is in fact decidable in time that is linear in the size of the program and exponential in the number of variables. We also show that checking whether a given program is streaming-coherent is decidable in . Note that even checking reachability in programs with Boolean domains has this complexity, and hence our algorithms are quite efficient.
Evaluation.
We implement a prototype of our automata-based decision procedure. This involves intersecting an automaton that checks whether a given streaming-coherent execution is memory-safe with an automaton that represents the given program’s executions. Instead of building these automata and checking emptiness of their intersection, we use an approach that constructs the automaton and the intersection on-the-fly, hence not paying the worst case costs upfront. We evaluate our procedure on a class of standard library functions that manipulate forest data-structures, including linked lists and trees, where various other aspects of the data-structures (such as keys, height, etc.) are modeled using an uninterpreted data domain. These are typically single-pass algorithms on such data-structures, that take pointers to forest data-structures as input (and may create non-forest data-structures during computation).
Though we have stringent requirements that programs must meet in order to be in the decidable class, we show in our experiments that most natural single-pass programs on forest data-structures meet our requirements. We also show that our tool is able to check if the program falls in the decidable class, and both verify memory safety and find memory safety errors extremely efficiently.
We emphasize that the novelty of our approach is in building decision procedures for verifying memory safety without the aid of human-given loop invariants, and without abstracting the heap domain (the data domain is, however, abstracted using uninterpreted functions). In contrast, there are several existing techniques that can prove memory safety when given manually written loop invariants or prove memory safety by abstracting the heap (which can lead to false positives). Our results hence carve out new ground in memory safety verification and our experiments show that our approach holds promise for wider applicability and scalability.
In summary, this paper makes the following contributions:
- •
A notion of alias-aware coherent programs (Section 3), and a result that shows that the assertion-checking problem for such programs is decidable and -complete (Section 4).
- •
A notion of streaming-coherence and forest data-structures ( Section 5) with an efficient decision procedure for verifying memory safety for the class of streaming-coherent programs that dynamically manipulate forest data-structures (Section 6).
- •
An efficient decision procedure determining if programs are streaming-coherent (Section 6).
- •
An experimental evaluation ( Section 7) showing (a) common library routines that manipulate forest data-structures using single-pass traversals are often streaming-coherent, and (b) that the decision procedures presented in this paper (for checking whether programs satisfy the streaming-coherent requirement and for checking whether streaming-coherent programs are memory safe) are very efficient, both for proving programs correct and finding errors in incorrect programs.
2. Preliminaries
In this section, we define the syntax and semantics of programs that manipulate heaps and other relevant concepts useful for presenting the main results of the paper. These are fairly standard and familiar readers may skip the details.
2.1. Syntax of Heap-Manipulating Programs
The programs we consider are those that manipulate heaps. It is convenient to abstract heap structures as consisting of two sorts of distinct elements — a sort of memory locations in the heap, and a sort of data values. Field (or map) symbols will model pointers from memory locations to memory locations or data values. Constants and functions over the data domain will be used to construct other data values. In this paper, we will not assume any fixed interpretation for either data values or for functions on data values. In this sense, our programs work over an uninterpreted data domain. Predicates over data values will be modeled by functions capturing the characteristic function of the predicate.
Let and be the sorts of locations and data respectively. Our vocabulary is a tuple of the form , where
- •
is a set of location constant symbols of sort ,
- •
is a set of unary location function symbols with sort ‘’ 111We will use the notation to indicate a function whose arguments are from sort and which returns a value in sort . Thus for example ‘’ is a function with one argument of sort and which returns an element of sort . On the other hand, ‘’ denotes functions with arguments each of sort and which returns an element of sort ., that models pointers between heap locations,
- •
denotes the set of data constant symbols of sort ,
- •
where is a set of data function symbols of arity and sort ‘’, and
- •
is a set of unary location function symbols with sort ‘’, modeling pointers to data values stored in heap locations.
2.1.1. Program Syntax
Programs will use a finite set of variables to store information — heap locations and data values — during a computation. Let us fix as the set of location variables and as the set of data variables and let 222We use to denote the disjoint union of sets and . be the set of all variables. In addition, our programs manipulate fields associated with location variables. We will model these fields as second order function variables (pointers from locations to locations) and (pointers from locations to data), and let . Taking , , , in , , and to be a tuple of variables in , the syntax of programs is given by the following grammar.
[TABLE]
Our programs have well-typed assignments to variables using values stored in other variables ( and ) or using pointer dereferences from location variables, either to the data sort () or to the location sort (), or using function computations in the data sort (). Further, programs can update fields ( or ), and can dynamically allocate () or deallocate () memory. In addition, they allow the usual constructs of imperative programming — empty statements (skip), conditionals () and loops (while). Conditionals in programs can be Boolean combinations of (well-typed) equality atoms over location or data variables.
2.2. Program Executions
Executions of programs over and variables (given by the grammar) are finite sequences over the alphabet given below.
[TABLE]
The set of executions of a program , denoted is given by a regular expression, inductively defined below. We assume that conditionals are in negation normal form where translates to and translates to .
[TABLE]
The set of partial executions of a program , denoted , is the set of prefixes of its executions.
2.3. Semantics of Executions
The semantics of a heap-manipulating program is given in terms of the behavior of its executions on heap structures.
2.3.1. Heap Structures
A -heap structure is a tuple , where is a universe of locations, is a universe of data () and is some interpretation of the various symbols in . In order to faithfully model dynamic memory allocation, we assume that the set of locations is the disjoint union of a statically allocated set of locations and a countably infinite set of locations that can be allocated dynamically. That is, we have , where is an ordered set of distinguished locations indexed by the set of natural numbers . The interpretation maps every constant to an element from , every constant in to an element from , every function symbol to an element of , symbol of arity to an element of , and, to an element of . Further, we assume that the elements in cannot be accessed from , i.e., for every , we have . This ensures that the set of locations reachable by any execution cannot access locations in the dynamic universe that have not been allocated by the execution. Moreover, for every and every , we have . Finally, we assume in the presence of a distinguished constant symbol and a distinguished function symbol not used in the syntax of programs. We require that for every heap structure the interpretation function of assigns interpretations to these symbols such that and with for every .
2.3.2. Valuation of Variables and Pointers in an Execution.
We assume that corresponding to each program variable , there is a distinguished constant of the appropriate sort denoting the initial value of . Likewise, each field (resp. ) is also associated with a unary function (resp. ).
Given an execution of a program and a -heap structure , the valuation of the program variables and field pointers at the end of are defined in terms of valuation functions , , and , which are presented next. In the following, denotes the number of occurrences of statements of the form in .
[TABLE]
[TABLE]
[TABLE]
[TABLE]
where by we mean the function defined as and otherwise.
Feasibility.
An execution is said to be feasible on if for every prefix of of the form , we have , and for every prefix of of the form , we have , where is the sort of both and .
3. Defining Coherence For Heap-Manipulating Programs
In this section we discuss some of the challenges involved in coming up with a reasonable extension for the notion of coherence as defined in (Mathur et al., 2019a). A key problem in defining such an extension, and indeed in generally handling programs with updatable maps, lies in keeping accurate track of aliasing between variables of the location sort. We will first discuss some examples that highlight these challenges and proffer a first solution to the problem. Then we will discuss our solution more formally by introducing relevant notation and define our notion of alias-aware executions and programs that captures the essence of the aliasing problem. Finally, we will discuss the notion of coherence adapted to the case of heap-manipulating programs.
3.1. The Importance of Being Alias Aware for Programs Updating Maps
Functions and updatable maps cannot be handled uniformly; in particular, the work of (Mathur et al., 2019a) does not immediately lend itself to handling updatable maps. Let us illustrate this using the following example.
Example 0.
Consider a straight-line program that generates execution , where
[TABLE]
In the execution above we do not, and in fact cannot, know the value stored in the variable unless we know whether or . In the former case, we will have that , since y aliases x and, therefore, the update of the next pointer on y will have over-written the value of the pointer on x. Similarly, in the latter case we will have that , since the pointer update on y will have no effect on x.
This is a major difference, since in any heap structure on which is feasible, it must be that . Therefore, whether x and y are aliases of each other can have a drastic effect on the semantics and feasibility of the execution.
The above example illustrates that aliasing plays a crucial role in the semantics and feasibility of executions. There is, however, a second (related) issue as well.
Example 0.
Consider the executions and where
[TABLE]
As discussed earlier, execution is only feasible in models where and only in models where where at the end of both kinds of models were feasible. Therefore one can have implied equalities and disequalities between variables that are only known much later in the execution. In general, this is hard to keep track of in a streaming setting (which we wish to do in order to use the underlying ideas in (Mathur et al., 2019a)) and can require an unbounded amount of memory, as can be seen in the following example.
Example 0.
Consider the following execution
[TABLE]
The above execution is feasible, but would require an unbounded amount of memory to reason as such in a streaming fashion. Next we will see a solution to this aliasing problem that will allow us to keep track of relationships between variables and the correct semantics of an execution.
3.2. Alias-Aware Executions and Programs
In Section 3.1, we saw that unlike the work of (Mathur et al., 2019a) we cannot define what term (or value) is computed by a variable. We also observed that the main issue with programs that have updatable maps is aliasing — i.e., when a pointer or data field is updated on a variable, the update may also be true for a different variable that aliases the original variable. In this section we identify a class of executions, called alias-aware executions, which implicitly resolves any aliasing when updating the pointer fields on heap locations.
Below, we formally define alias-aware executions. We denote by the set of heap structures on which the execution is feasible.
Definition 0.
Let be an execution. is said to be alias-aware if for every prefix of of the form (the sorts of and are compatible), and for every location variable , one of the following hold
- •
For every heap structure , .
- •
For every heap structure , .
Intuitively, the above definition says that for an alias-aware execution, at a map update we know all the relevant aliasing information (even in the uninterpreted sense), because the variables that alias to the variable on which the update is being performed are the same in every feasible model.
Definition 0 (Alias-Aware programs).
A program is said to be alias-aware if every execution is alias-aware.
3.3. Term Computed by a Variable during an Execution
Before we define coherence in our setting, we require the notion of terms that an execution computes. The notion of terms lets us reason about infinitely many heap structures in a symbolic fashion and was crucially exploited in (Mathur et al., 2019a) to define coherence and in arguing the correctness of the decision procedure designed for coherent programs.
We would like to define the term associated with a program variable in an execution. There are two major challenges in our way. The first challenge is that, in order to accurately determine the value pointed to by a variable during an execution on a concrete heap structure, one needs to accurately keep track of the interpretations of pointer fields (that could have, in turn, been updated earlier in the execution). A similar problem needs to be addressed in order to successfully define the notion of terms. The second challenge is that unlike in concrete heap structures where two distinct elements in the heap are known to be unequal, terms do not behave the same way. For the case of terms, we would like to explore the possibility that two terms that might be syntactically different might still be semantically equivalent. While this subtlety can be dealt with easily when none of the functions are updatable, as in (Mathur et al., 2019a), greater care is required in our setting. Let us first formalize some notation and then elaborate these subtleties.
For a set of constant symbols and a set of function symbols from , we use (or just Terms when the signature is clear) to denote the set of (well-sorted) ground terms constructed using the constants in and function symbols in .
For a binary relation , the congruence closure of , denoted is the smallest equivalence relation such that (i) , and (ii) for every function of sort () and terms of sorts , we have \Big{(}\bigwedge\limits_{i=1}^{r}(t_{i},t^{\prime}_{i})\in\cong_{R}\Big{)}{\implies}(f(t_{1},\ldots,t_{r}),f(t^{\prime}_{1},\ldots,t^{\prime}_{r}))\in\cong_{R}. We say if , where are terms.
We now define the terms associated with variables , the interpretations of updatable maps on terms corresponding to variables , and the set of equalities accumulated by the execution . Recall that, in the following, and are special symbols in our vocabulary.
[TABLE]
[TABLE]
[TABLE]
The set of terms computed by an execution is the set .
Let us discuss some aspects of the above definition here. The effect of some of the commands in the executions is obvious and similar to (Mathur et al., 2019a). At the beginning of an execution, each term is associated with a unique constant term , each pointer assigns every term (of the location sort) to , and the set of equalities accumulated is . On an assignment , the term stored in is updated to be that stored in . On an assignment involving a (non-updatable) function (), the execution computes the term , and stores it in , where is the term associated with before the assignment.
Let us now look at the other commands. Recall that we model our set of dynamically allocated locations as a disjoint set from the set of statically allocated locations. The term associated with after an follows this premise. It is built using applications of a distinct function on a distinct constant , where is the number of allocations performed before this point in the execution. Let us now turn to the most subtle aspect: pointer updates. When the execution encounters a command of the form , we need to not only update the term pointed to by (where is the term associated with ), we also need to update the term pointed to by , when can be inferred to be equivalent to using the equalities accumulated so far.
Example 0.
Consider the following execution.
[TABLE]
Here, the terms associated with the pointers and are and respectively. After the , the locations pointed to by x and y are the same in every heap structure on which this assumption is valid, and thus the terms corresponding to x and y must be deemed equivalent. This means that the pointer update should in fact be reflected in the term associated with the pointer . The above definition of and will, in fact, ensure that . Not doing so will, in turn, not reflect the contradiction due to the last statement of this execution, which makes it infeasible in every heap structure.
It turns out that the above definition of terms associated with variables is only a best effort, in that, it may not accurately summarize all heap structures in which the execution is feasible. The following example illustrates why this is the case.
Example 0.
Let us consider the following, which is a permutation of the execution in Example 6.
[TABLE]
Similar to the execution in Example 6, the execution is infeasible in every heap structure. However, in this case, the terms associated with and will be and , which are not equal, even when considering the equivalence induced due to . As a result, it is hard to conclude that is an infeasible execution by analyzing the terms associated with variables alone.
However, for the case of an alias-aware execution (Definition 4), the definition of terms above indeed accurately relates the terms associated with each variable to their values in every heap structure that makes the execution feasible. This brings us to the first main result of the paper that motivates the definition of alias-aware executions. It simply states that the interpretation of the terms defined above on a given model coincides with the actual values computed on the model.
Theorem 8**.**
Let be an alias-aware execution and let be a heap structure on which is feasible. Then, for every variable and for every .
Moreover, and for every and .
Proof.
The proof proceeds by induction on the length of the execution. The base case and most inductive cases are trivial by the definitions of , etc. The only nontrivial case is that of map update. Consider an execution of the form . It is easy to see that definitions of and update the map on the location pointed to by . However, the definition of Comp updates the appropriate term for the map only on variables whose terms are equivalent modulo . We simply establish that for alias-aware executions the equivalence modulo is sufficient to determine which variables alias . This is owing to the fact that the initial model of terms modulo the congruence closure of is feasible on . And in that model, the only variables that alias are those that compute terms equivalent to modulo . Since is alias-aware, we have that in any model the variables that alias to are the same as those that alias to in the initial model.
We establish that the initial model of an alias-aware execution is a feasible model of , by strong induction on the length of the execution, where we strengthen our inductive hypothesis. We present the stronger inductive hypothesis below, the proof of which is a simple mutual (structural) induction on the execution.
Given an alias-aware execution that is feasible (i.e., feasible in some model) and a sub-execution :
- •
is feasible on , where is the initial term modulo the congruence closure of .
- •
, where represents the equivalence class of terms modulo the congruence defined by .
Similarly,
and
- •
is feasible on .
- •
Similarly,
and
∎
3.4. Coherent Programs
Having defined the concept of terms associated with variables in an execution, we can now extend the notion of coherence, adapting it from (Mathur et al., 2019a) to accommodate updatable pointers as in the heap-manipulating programs that we study.
In the following, we say that is a superterm of modulo some congruence relation if there are terms and such that , is a superterm of and .
Definition 0 (Coherence).
A complete or partial execution is said to be a coherent execution if it satisfies the following two conditions.
**Memoizing.: **
Let be a prefix of such that it is of the form or and let . If there is a term such that , then it must be the case that there is some variable such that .
**Early Assumes.: **
Let be a prefix of the form and let . If there is a term such that is a superterm of or modulo , then there must be some variable such that .
Note that these conditions are applicable to every sort-sensible combination of symbols.
As mentioned earlier, this definition of coherence is inspired from the notion of coherence defined previously in (Mathur et al., 2019a). The memoizing restriction is the heart of the notion of coherent executions. Let us illustrate with a simple example.
Example 0.
Let be the following execution:
[TABLE]
It is easy to see that in the above execution , the values of the variables u and v will be equal after executing all the assignments, in any heap structure. As a result of the last assumption , the execution will thus be infeasible in all heap structures.
However, in order to accurately determine the relationship at the end of the execution, one needs to keep track of an unbounded amount of information (as can be increased unboundedly). This is the crucial insight in the first condition (memoizing), which states that when a term has been computed and dropped (i.e., no variable points to the term), then the execution should not recompute this term. Indeed, the above execution does not meet this requirement — the execution computed the term in its first half and stored it in u, re-assigned this variable u immediately after computing , thereby losing all copies of , and then later recomputed this term again in its second half. In fact, each of the terms () have been recomputed without retaining their original copies. Clearly, is not a coherent execution.
A similar example highlights the importance of the second coherence restriction (early assumes).
Example 0.
Let be the following execution:
[TABLE]
Observe that this execution is similar to the execution in Example 10. Also observe that, as before, this execution is infeasible in any heap structure. However, any algorithm that would accurately determine this in a streaming fashion would require access to unbounded memory.
Definition 0.
A program is said to be coherent if all its executions are coherent.
4. Assertion Checking for Coherent Alias-Aware Programs
In this section we discuss our first main result — decidability of assertion checking for programs that are both alias-aware as well as coherent.
Let us first define the problem of assertion checking. For this, we augment our programs with a special statement ‘assert(false)’, and thus our new syntax is given by the following grammar ( is as defined in Section 2.1.1).
[TABLE]
Observe that more complex assertions (including boolean combinations of equality/disequality assertions) can be expressed by translating them to conditional statements. For example, the assertion ‘’ would translate to the program ‘’.
The set of executions for programs with assertions consists of sequences over the alphabet and can be defined as in Section 2.2, with the addition of the following rule:
The functions and in the presence of are defined as before for execution prefixes without , and can be assumed to map all elements in their domain to a special symbol for all executions containing . The feasibility of a partial execution on the alphabet on a heap structure is defined as before (Section 2.3) and is undefined otherwise.
Definition 0 (Assertion Checking for Heap-Manipulating Programs).
The problem of assertion checking for a program is to check whether for every heap structure and for every partial execution of of the form , we have that is infeasible on .
We first note that the above problem is undecidable in general, a direct consequence of (Mathur et al., 2019a). Indeed, uninterpreted programs are heap-manipulating programs that do not mention any heap variables.
Theorem 2**.**
Assertion checking for heap-manipulating programs is undecidable.
Finally, we state our first decidability result. In the following, an alias-aware coherent program is a program that is both coherent and alias-aware.
Theorem 3** (Decidability of Uninterpreted Assertion Checking).**
Assertion checking is decidable for the class of alias-aware coherent programs and is -complete.
The proof of the above result relies on the following observations. First, for alias-aware executions, the terms associated with variables reflect their relationships on all heap structures (Theorem 8). Second, in this case, the streaming congruence closure algorithm for checking feasibility of coherent executions introduced in (Mathur et al., 2019a) can be extended directly to the case of coherent executions of heap-manipulating programs. The -hardness follows from the -hardness result for uninterpreted programs without updates ((Mathur et al., 2019a)). As a consequence of these observations, we also obtain the following result:
Theorem 4**.**
The problem of checking membership in the class of coherent and alias-aware programs is decidable and is in .
5. Memory Safety, Forest Data-Structures, and Streaming-Coherence
We now tackle the problem of memory safety verification for heap-manipulating programs. The goal here is, given a program and an allocated set of locations (defined using a reachability specification), to check whether the program only dereferences (using pointer fields) locations that are in the (dynamically changing) set of allocated locations. We develop a technique for when the initial heap holds a forest data-structure (disjoint lists and tree-like structures). We define a subclass of programs, called streaming-coherent, for which memory safety is decidable. The key idea is to utilize the fact that the initial heap is a forest data-structure in order to make programs alias aware.
For an execution that works over a forest data-structure, one can accurately infer the aliasing relationship between two program variables by tracking if they point to locations on the heap obtained by traversing the same path (i.e., starting from the same initial location and taking the same pointers at each step). Intuitively, in a forest data-structure, two distinct traversals always lead to different locations. In addition, when the program execution does a single pass over the data-structure, one can keep track of the aliasing relationship between variables (or equivalently, whether the locations pointed to by two variables are same or not) using a streaming algorithm. The notion of streaming-coherence essentially ensures such a single pass. Finally, updatable maps can be used to keep track of the initial allocated set and the allocations/frees performed by the program during its execution. Consequently, memory safety can be reduced to assertion checking.
This section is organized as follows. Section 5.1 introduces the reachability specifications that specify the initial set of allocated locations, and defines the memory safety problem formally. We also show here that memory safety is undecidable in general, and undecidable even for coherent programs. Section 5.2 defines the class of forest data-structures, and shows that memory safety remains undecidable for programs that start with a heap that holds a forest data-structure. Section 5.3 defines the notion of streaming-coherent programs.
5.1. Reachability Specification and Memory Safety
Reachability Specification
Heap-manipulating programs can be annotated by a reachability specification that restricts the allowable nodes that can be accessed by a program. A reachability specification is an indexed set of triples where is such that , and . Each triple denotes a set of locations , which is the least set that contains , does not include and is closed under repeated applications of pointer fields . More formally, given a heap structure with interpretation , gives a set of locations, which is the smallest set such that
(a) , and
(b) for every and for every , if , then .
We let . Often the heap structure will be implicit and we will omit mentioning it.
Allocated Nodes
Starting with a reachability specification on a given heap structure , an execution defines a set of allocated nodes, which we denote as and define as follows.
[TABLE]
Memory Safety
An execution is said to violate memory safety over a heap structure with respect to a reachability specification if there is a prefix of such that is feasible over and one of the following holds.
- (1)
dereferences a location that was not allocated. That is, is of the form or , and and are variables and pointer fields of appropriate sorts, such that . 2. (2)
frees an unallocated location. That is, is of the form and .
An execution is memory safe over with respect to if it does not violate memory safety over with respect to . With this, we can now define the memory safety verification problem. In the following, we fix our signature .
Definition 0 (Memory Safety Verification).
The memory safety verification problem asks, given a program and a reachability specification , whether for all heap structures , each execution is memory safe over with respect to .
We show, unsurprisingly, that checking memory safety is undecidable in general.
Theorem 2** (Undecidability of Memory Safety).**
Memory safety verification is undecidable.
Proof.
In (Mathur et al., 2019a), the authors consider uninterpreted programs, which are programs that have variables taking values in a data domain that is uninterpreted; programs in (Mathur et al., 2019a) don’t have heap variables, and do not modify heaps. It was shown (Theorem 11 in (Mathur et al., 2019a)) that given an uninterpreted program , the problem of determining if there is a data domain and an execution of , such that is feasible in , is undecidable. Our result here can be proved by a simple reduction from that problem. Let be an uninterpreted program. Consider the reachability specification such that is a new variable. Consider program . Observe that is memory safe with respect to if and only if does not have a feasible execution with respect to some data model. ∎
Given the undecidability result in Theorem 2, we need to identify a restricted subclass of programs and initial heap structures for which the problem of verifying memory safety is decidable. This leads us to the notions of forest data-structures and streaming-coherent programs. In some sense, forest data-structures are natural classes of heap structures where the alias-aware restrictions become very minimal. We combine these remaining restrictions with our usual notion of coherence to introduce a new notion of streaming-coherent programs. We define these below.
5.2. Forest Data-Structures
Let us now define our characterization of heap structures that will be amenable to our decidability result. The main restriction is stated in bullet (3) below and intuitively disallows two different paths to the same location on the heap.
Definition 0 (Forest Data-structures).
A heap structure over a signature is said to be a forest data-structure with respect to a reachability specification if
- (1)
for every , each set of stopping locations is a singleton set of the form , 2. (2)
for every , and for every , we have that , and 3. (3)
for every and we have, if , then either , or .
Intuitively, a heap structure is a forest data-structure with respect to , if the subgraph induced by the set of nodes (excluding ) reachable from , using any number of pointers from forms a tree, and further, any two subgraphs and do not have a node in common (except possibly for the starting locations). Notice that Definition 3 do not impose any restrictions on the elements of the data sort of a heap structure.
Our notion of forest data-structures handles the aliasing problem while still being able to express many practical reachability specifications. Consider the following example similar to our pathological example from Section 3.1.
Example 0.
where
[TABLE]
With respect to forest data-structures the above pathological execution is, in fact, infeasible. To understand this, observe that for a forest data-structure, if two different locations (x and y in the execution for instance) are not equal to the stopping location ( and ), then these locations are also different from each other (and thus ). This means that in the execution , the update will not affect the value of . Now, when the execution reads the value of in , it is expected to be the same as earlier (i.e., some location as pointed to by ), but since , we must have . This means that the last assume makes the execution infeasible.
However, the notion of forest data-structures, by itself, is not enough to ensure decidability, as is shown by the following result.
Theorem 5**.**
The memory safety verification problem for forest data-structures is undecidable.
Proof.
Follows trivially from Theorem 2 because the reachability specification used in the proof of Theorem 2 is such that is the empty set. ∎
Next, we introduce a class of programs called streaming-coherent programs that, in conjunction with forest data-structures results in a decision procedure. We will discuss the decision procedure in Section 6.
5.3. Streaming-Coherent Executions and Programs
In this section, we identify the class of streaming-coherent programs and executions, for which we show decidability. It is analogous to the notion of coherence (Definition 9), except that it uses a different notion of equivalence (for terms) instead of . We define this new notion of equivalence (forest equality closure) below. Intuitively, this notion characterizes the behavior of executions on the class of forest data-structures and allows us to adapt the algorithm used for assertion checking of alias-aware coherent programs (Theorem 3).
Definition 0 (Forest Equality Closure).
Let be a reachability specification, with . Let (). Let be an equality relation on terms. The forest equality closure of with respect to , denoted is the smallest congruence relation that satisfies the following.
- •
.
- •
For every and such that , we have
[TABLE]
We will now define the notion of streaming-coherent programs with respect to a given reach specification . To do this, we first need the notion of streaming-coherent executions. This notion is very similar to the notion of coherence (Definition 9) and requires that terms are not recomputed and that assumes over the data sort are early.
Definition 0 (Streaming-Coherence).
Let be a complete or partial execution. is said to be streaming-coherent if it satisfies the following two conditions.
**Memoizing.: **
Let be a prefix of of the form or and let . If there is a term such that , then it must be the case that there is some variable such that . This condition is applicable to every sort-sensible combination of symbols.
**Early Assumes.: **
Let be a prefix of the form where and . If there is a term such that is a superterm of or modulo , then there must be some variable such that .
Observe that the above definition is close the definition of coherence (Definition 9), but nevertheless there are important differences. First, the early assumes requirement does not apply to variables of the sort because forest data-structures simplify that requirement. Second, the congruence with respect to which equalities are demanded in the above definition is not one of the congruence closure defined by the equalities accumulated by the execution (which is the congruence used in Definition 9), but rather the congruence of forest equality closure.
Definition 0.
A program is said to be streaming-coherent with respect to if all its executions are streaming-coherent with respect to .
We note here that streaming-coherent programs on forest data-structures are essentially one-pass algorithms. Intuitively, forest data-structures enforce the alias-aware restriction by mandating that two locations obtained by two different traversals from the set of initial locations (therefore being represented by two different terms) are different (with minor exceptions). Therefore if a streaming-coherent execution computes a term twice, i.e., visits a location twice, by definition it has to store a pointer to that location in some variable. Since the number of variables is fixed a priori, it is simple to see that a nontrivial, multi-pass algorithm such as linked-list sorting would inherently be non-streaming-coherent (since it is meant to work on lists of arbitrary size).
6. Streaming Congruence Closure for Forest Data-structures
We will now focus on streaming-coherent programs whose initial heap is a forest data-structure. We now present the second main result of our paper.
Theorem 1** (Decidability of Memory Safety for Streaming-Coherent Programs and Forest Data-Structures).**
The memory safety verification problem over forest data-structures for streaming-coherent programs is decidable and is -complete.
Given a reachability specification, we present an algorithm for checking memory safety of such programs. This will establish the decidability and the membership of the problem in . The hardness follows from the -hardness result in (Mathur et al., 2019a).
Our algorithm is automata theoretic— we construct a finite state automaton such that its language includes all streaming-coherent executions that are memory safe and excludes all streaming-coherent executions that are not memory safe. Finally, in order to determine if a given streaming-coherent program is memory safe, we simply check if ; since is also a regular language, this reduces to checking intersection of regular languages.
Let us first discuss some intuition for our construction. The state of the automaton has several components which can be categorized into two groups:
(1) those that track feasibility of executions, and
(2) those that are used to check for violations of memory safety.
The first category comprises of three components which keep track of equalities, disequalities, and functional relationships between variables at a given point in the execution. These components are inspired by the work of (Mathur et al., 2019a) in the context of coherent uninterpreted programs, where this information was used to answer questions of assertion checking/feasibility. Our extended and refined notion of streaming-coherent programs also has this property, and these three components serve the same roles in our setting. Similar to the work in (Mathur et al., 2019a), when an execution is streaming-coherent, these three components can be accurately maintained at every step in the execution by performing local congruence closure.
We now describe the remaining components of the state, namely those that help check for memory safety. We know from our earlier discussion that forest data-structures are implicitly alias aware. This is because such heap structures consist of many disjoint reachability specifications each of which describes a tree. Therefore, nodes obtained by different pointer traversals from the initial locations are different, and each tree in the forest is closed under a given set of pointers. We keep track of locations (for the purpose of memory safety) primarily using three ordered collections which are disjoint:
(a) a collection of ‘yes’ sets,
(b) a collection of ‘maybe’ sets and
(c) a ‘no’ set .
Each and above correspond to the reachability specification . In the following, we describe each of these.
is a set of variables that point to locations which, when dereferenced, lead to a memory safety violation. These locations include the stopping locations of any of the reachability specifications, or those locations in the allocated set that have been freed during the execution. Consider the reachability specification . At the beginning of any execution, the set for this specification is . Further, when the execution encounters the statement , we update the component of the state by adding y.
The sets hold a subset of program variables that point to locations obtained by traversing the reachability specification . In particular, a variable z belongs to when we know that the corresponding location can be dereferenced without causing a memory safety violation in any heap structure. This can happen when the execution establishes that the location pointed to by z is not the stopping location of (i.e., ). Consider and the single step execution . At the end of we know that in all heap structures in which is feasible, it must be that the location pointed to by x can be dereferenced, and therefore we would add x to the corresponding ‘yes’ set.
Lastly, the ‘maybe’ sets hold variables that can be obtained by traversing the tree of , but are neither known to be stopping locations/freed locations, nor known to be in the allocated set in all feasible heap structures. Consider again and the execution . Recall that x belongs to the ‘yes’ set after the first statement (i.e., the execution ) and we can therefore dereference it. However, at the end of the execution we would include y in the ‘maybe’ set because the location is not in the allocated set in all feasible heap structures. In particular, a heap structure defining a list of length one is a feasible structure for this execution but would be , which is not in the allocated set. Now consider the execution . After the last statement in we would now shift y from the ‘maybe’ set to the ‘yes’ set.
This is precisely what our automaton does, shifting variables between components and flagging memory safety violations when the execution dereferences (the location pointed to by) a variable that is not in any . However, we do this modulo equalities between variables and these sets described above are in fact sets of equivalence classes. Lastly, we also keep track of variables that point to locations allocated by using a set of equivalence classes of variables. Of course, we would transfer an equivalence class from to if the execution frees this memory, similar to what happens to members of the ‘yes’ sets.
6.1. Formal Description of Construction
We shall now proceed with the formal construction of the automaton , which accepts a streaming-coherent execution iff it is memory safe. Recall that our reachability specification is an indexed set of tuples , where . To simplify presentation, we assume that the set of variables in our programs is such that for every constant appearing in the reachability specification , there is a variable corresponding to . Further, we assume that these variables are never overwritten. We will, therefore, often interchangeably refer to these constants in by their corresponding variables, and vice versa. These assumptions can be relaxed with a more involved construction.
The automaton is a tuple , where is the set of states, is the initial state and is the transition function. Recall that executions are strings over , which is also the alphabet of the automaton . We describe each of these components below.
States. The automaton has two distinguished states and . All other states are tuples of the form , where each component is as follows.
- •
is an equivalence relation over that respects sorts. We will use to denote the set
- •
is a symmetric set of pairs of the form , where are equivalence classes.
- •
associates partial mappings to function symbols in and pointer field in . More formally, for every of arity , and classes , we have (if defined). Similarly, for every , and every , , and for every , and every , . We will say when the function/pointer symbol is not defined on the arguments .
- •
and are indexed collections of sets such that for every , we have .
- •
The sets , and are sets of equivalence classes of location variables, i.e., .
Initial State. The initial state is the tuple such that
- •
is the identity relation on the set of variables,
- •
is such that for all functions and pointer fields , the range of is empty,
- •
each of and are (empty set),
- •
for each , ,
- •
.
- •
.
Transitions. The states and are absorbing states. That is, for every , and . In the following, we describe the transition function for every other state in . Let , and let . Then, is or , or is of the form .
- (1)
Case , .
In this case, we add the variable into the class of and appropriately update each of the components. That is, {\equiv}^{\prime}=\big{(}{\equiv}\setminus\{(u,u^{\prime}),(u^{\prime},u)\,\mid\,u\neq u^{\prime},u^{\prime}\in[u]_{{\equiv}}\}\big{)}\cup\{(u,v^{\prime}),(v^{\prime},u)\,\mid\,v^{\prime}\in[v]_{{\equiv}}\}. The other components of the state are the same as in . 2. (2)
Case , and .
In this case, we need to check if the variable corresponds to a location that can be dereferenced. If not, we have a memory safety violation; otherwise, we establish the relationship in the next state. Formally, if there is no such that and if , then . Otherwise we have that or there is a such that . In this case we define the tuple below. Here, we need to consider the following cases.
- •
Case is defined and equals . In this case, is defined in the same manner as though .
- •
Case . Here, for streaming-coherent programs it must be the case that for some . Here, we create a new singleton equivalence class containing and set the value of the map, on to be this new class. We also assert that is not equal to any class in any of the s or in . That is,
- –
.
- –
- –
. For all other combinations of functions/pointers and arguments, behaves same as .
- –
One of the sets and are updated depending upon the pointer . If , then . Otherwise, .
- –
All other components are the same as in . 3. (3)
Case , and .
As in the previous case, if . Otherwise, similar to the previous case, we have two cases to consider. As before, if there is a variable such that , then we treat this case as that of . Otherwise, the new equivalence relation is such that , while the other components are the same as in 4. (4)
Case , .
We uniformly handle the case of being either a pointer field () or a data field (). Here we have if . Otherwise, we simply change as follows (while keeping other components the same as in ):
- •
for
- •
- •
For such that , if for some variable , and if no such variable exists. 5. (5)
Case , .
Here, we have two cases to consider again. If there is a variable such that , then we treat this case as that of . Otherwise, we add a singleton equivalence class containing and update , while keeping all other components the same (modulo the new equivalence relation). Formally,
- •
.
- •
is same as in if . The evaluation of on is described as follows.
[TABLE]
- •
All other components are the same as in . 6. (6)
Case .
In this case, we create a new singleton class containing , and add this class to the component . We also assert that this new class is not equal to any other class. Formally,
- •
.
- •
- •
.
- •
The component is updated as follows. For every function symbol , is the same as . Further, for every pointer field and for every variable , we have . Finally, the mapping on is defined as for every location pointer and for every data pointer .
- •
All other components are updated as usual. 7. (7)
Case .
In this case, if , then . Otherwise, we remove the class from the sets or and add it to the set . That is,
- •
- •
- •
for every ,
- •
- •
Other components remain the same. 8. (8)
Case , . In this case, if then . Otherwise we have several cases to consider.
In each of these cases, we construct a new tuple . Finally, we set if ; otherwise we have .
- •
The first case to consider is when . In this case, we merge and . More formally, is the smallest equivalence relation such that \big{(}{\equiv}\cup\{(x,y)\}\big{)}\subseteq{\equiv}^{\prime\prime}. Further, for every component with the corresponding component in being , and for every such that , we have iff . For a variable , we have iff . The other components of are the same as in (modulo the new equivalence classes).
- •
Otherwise, consider the case when for some . In this case, in addition to adding we also add the pair . Similarly if for some we add . Construct the state with being the smallest equivalence relation including these new pairs (and other components remaining the same). 9. (9)
Case , . Similarly as above, in this case when we have . If then . Otherwise, we have the following cases:
- •
and for some (or vice versa). In this case, we simply put the equivalence class of into and assert that is unequal to all other classes. More formally, , . The other components remain the same.
- •
Otherwise, we simply update ; all other components are the same. 10. (10)
Case , . In this case, we merge equivalence classes repeatedly and perform a ‘local congruence closure’. We construct a state to determine if the transition must be to . More formally, we define the state with the component as the smallest equivalence relation such that:
(a) (b) If for and then .
The other components remain the same. In particular, it is correct to retain the component since the above construction is a congruence relation. Finally, if there exist such that and then . Otherwise, . 11. (11)
Case , . Similarly as above, if then . Otherwise, we update and all other components remain the same.
The following theorem states the correctness of the automaton .
Lemma 2.
Let be a streaming-coherent execution and let be a reachability specification and let be the state of the automaton after reading . Then, iff there is a forest data-structure (with respect to ) such that violates memory safety on .
The problem of checking if a streaming-coherent program is memory safe against a given specification is decided as follows. Recall that the set of executions of a given program constitutes a regular language . Let be the set of executions that do not go to the state . Then, the problem of checking if is memory safe reduces to checking if .
Next, we show that the problem of checking streaming-coherence is also decidable. To address the problem of checking streaming-coherence, we construct an automaton similar to (similar to the automaton for checking coherence in (Mathur et al., 2019a)) that keeps track of the following information. For every function/pointer of arity , and for every tuple of variables (of appropriate sorts), each state of the automaton maintains a boolean predicate denoting whether or not has been computed in any execution that reaches the state. This gives us our next result.
Theorem 3**.**
The problem of checking whether a given program is streaming-coherent with respect to a given reach specification defining a forest data-structure is decidable in .
7. Implementation and Evaluation
We implemented a tool (Mathur et al., 2019c) for deciding memory safety of forest data-structures based on the construction from Section 6. The tool is ~2000 lines of Ocaml 4.07.0 code. It takes as input a program from the grammar presented in Section 2.1.1 annotated with a reachability specification, as in Section 5.1. The tool does not explicitly construct the automaton from Section 6. Instead, it implements a fixpoint algorithm, described below, which determines the set of states associated with every program point, and uses this set to check for memory safety.
7.1. Fixpoint Algorithm
Here we give a high-level description of the implementation. First, observe that has exponentially many states in the number of program variables.We manage this by implementing the transition from Section 6 and only building automaton states as they are encountered. For straight-line programs, each transition results in a single new state. For complex programs that use if-then-else and while statements, we need to keep track of a bag of states. To see this, suppose we are checking the program . To begin our bag of states contains only the initial state . In order to process the if-then-else, the initial state needs to make two separate transitions, one for each of the two executions generated by the two branches. The bag of states thus grows to include and . The branches can then take transitions starting from and . The union of the reachable states from each branch gives us a new bag of states from which to process the remaining program. The intuition for if-then-else carries over to while. From any state in our bag at the beginning of a while, we collect the bag of states that results from any number of executions of the loop guard and body, starting from that state. The number of states is finite, and thus a fixed point is guaranteed. For the benchmarks considered here, the number of states explored by the tool is significantly smaller than the worst case.
If at any point the tool detects a memory safety violation it halts and reports the error. In addition to memory safety, it also monitors the streaming-coherence property as it processes the input program. To do so, it keeps track of terms that were computed using existing equivalence classes, but which were subsequently dropped. If the program attempts to compute the same term using the same classes, the implementation flags a failure of streaming-coherence and halts. For example, to process , the tool checks that was not previously applied to and later dropped. In general, this information can be maintained by remembering the equivalence classes on which any function (of arity ) has been computed.
The algorithm implemented by the tool is more general than we have described thus far. It can output the set of states reached at the head of a loop. By interpreting individual states as conjunctions of equality and disequality, and the set of states as a disjunction of such conjuncts, we can obtain an inductive invariant that proves memory safety (when the program is memory safe). Any assertion in the form of a Boolean combination of equality statements on program variables can also be checked. This can be accomplished by appending the negated assertion to the end of the program and checking that all reachable states are infeasible.
7.2. Benchmarks
We seek to answer the following basic questions about streaming-coherent programs and our algorithm. First, is it the case that the most natural way to write heap-manipulating single-pass programs on lists and trees results in streaming-coherence? Second, for streaming-coherent programs with and without memory safety violations, is the algorithm able to verify memory safety or find violations of it in the corresponding uninterpreted program? Third, how fast is the algorithm? Note that since we do abstract the primitive types and functions/relations on real programs, it is not clear that the tool will be able to prove memory safe programs as so.
To answer the first and second questions, we wrote natural heap-manipulating programs over singly-linked lists (sorted and unsorted) and tree data-structures (binary search trees, AVL trees, rotations of trees) in our input language, and evaluated the tool on them to determine if they were streaming-coherentand to test for memory safety.
The first column of Table 1 gives the set of programs in our benchmark. These are typically single-pass algorithms over an input data-structure. For example, finding a key in a binary search tree or in-place reversal of a linked list are single pass algorithms.
The names of the programs indicate whether or not the program truly contains an unsafe memory access (i.e., the ground truth). Programs whose names end in ‘unsafe’ were obtained by introducing one of two possible memory safety errors into their ‘safe’ counterparts: (i) attempts to read or write to a location that is unallocated, and (ii) freeing unallocated memory locations.
One example of the first kind is illustrated in sll-copy-all, which copies the contents of a linked list into a freshly allocated list. In this example, the program steps through the input list in a loop until it reaches . In each iteration, a new node is allocated, initialized with the contents of the current node, and connected to the end of the new list. The program relies on the invariant that the new list has a next node to step to whenever the old list does. Thus, it does not perform a check when advancing along the next pointer for the new list. The sll-copy-all-unsafe fails to maintain the invariant by incorrectly adding the freshly allocated node to the new list. An example for errors of the second kind (freeing memory locations that may not be allocated) can be found in sll-deletebetween-unsafe. In this example, the task is to delete all nodes in a linked list that have key values in a certain range. The mistake in this example happens when the program has found a node to delete, but, instead of saving the next node and deleting the current node, it instead frees the next node, which may be unallocated.
7.3. Discussion of Results
Table 1 shows the results for the evaluation, which was performed on a machine running Ubuntu 18.04 with an Intel i7 processor at 2.6 GHz. Columns 3-6 pertain to the operation of the algorithm on the benchmarks. Column 3 indicates whether or not the benchmark fails the streaming-coherence condition. Our tool terminates and identifies memory safety and violation of memory safety on all streaming-coherent programs. Column 4 depicts whether or not an unsafe memory access was detected. Column 5 gives the total number of states that are reachable at the end of the program. Note that non-streaming-coherence and violation of memory safety preclude each other in the table. Upon detecting either, the algorithm halts (and we do not report the number of reachable states). Column 6 gives the total running time of the tool on each benchmark, which is negligible in all cases. Note that the number of reachable states for each example is also quite small relative to the total number of possible states, which grows faster than the Bell numbers. That our algorithm only examines a small fraction of the total state space is encouraging, and suggests that it may scale well for much larger and more complex programs.
8. Related Work
Memory safety errors have attracted a lot of attention because they are serious security vulnerabilities that have been exploited to attack systems (Nagarakatte et al., 2015; Szekeres et al., 2013); they are one of the most common causes of security bugs (Microsoft, 2019). Memory safety concerns have even led to new programming languages, such as Rust (The Rust Team, 2019), that statically assure memory safety (while being efficient). Memory safety vulnerabilities of programs written in C/C++ are still of great concern, and, consequently, identifying fundamental techniques that establish decidability of the problem even for restricted classes of programs is interesting.
There is a rich literature of preventing memory safety errors at runtime by instrumenting code with runtime checks, or at compile time (see (Nagarakatte et al., 2015) and references therein, SafeC (Safe-C, 2019), CCured (Necula et al., 2002; Austin et al., 1994; Necula et al., 2005; Condit et al., 2003), Cyclone (Jim et al., 2002), SVA (Criswell et al., 2007), etc.). Static checking for memory safety is certainly possible when it is part of language design (for instance, using type systems as in Rust (Matsakis and Klock, 2014)). Dynamic analysis techniques as in (Nethercote and Seward, 2007; Rosu et al., 2009; Serebryany et al., 2012) instrument program executables and report errors as they occur during program execution. Recently, there has also been emerging interest in enforcing runtime memory safety using hardware and software support for tagged memory (Oleksenko et al., 2018; Joannou et al., 2017; lowRISC, 2019; Watson et al., 2015; Serebryany et al., 2018).
This work stems from the recent decidability result on uninterpreted coherent programs (Mathur et al., 2019a), which has also been extended to incorporate reasoning modulo theories including associativity and commutativity of functions over the data domain and ordering relations on the data domain (Mathur et al., 2019b). In our work we use automata-theoretic techniques that, over the data domain, reason about equality and function computation over the data elements. Further, for checking memory safety, our procedure tracks a subset of the allocated nodes, namely, the frontier nodes. While a considerable portion of the literature on assertion checking and memory safety for heap-manipulating programs is devoted to techniques that compromise on either soundness, completeness, or decidability, there has been some work that aims at decidable reasoning, while still preserving some form of soundness and completeness.
The work in (Alur and Černý, 2011) reduces assertion checking of single-pass list-processing programs to questions on data string transducers that work over sequences of tagged data values. The decidablity is mainly a consequence of the fact that there is a single variable that advances in a single-pass fashion. In contrast, our work defines a more general notion of single-pass programs using the streaming-coherence restriction that still allows for multiple variables to support pointer updates. Further, the work in (Alur and Černý, 2011) does not directly apply to verifying memory safety of programs as it does not explicitly handle freeing of memory locations, and the operations of the transducer cannot effect changes to the shape of the heap. Another key difference is that streaming data string transducers only allow reasoning about ordering and equality over data but cannot support more complex reasoning such as the congruence arising from function computation.
The work in (Bouajjani et al., 2006) proposes a class of list programs for which verification is decidable, and crucially relies on the idea of representing fragments of the allocated heap by a bounded number of segments and summaries about them, which is one among many other works (Sagiv et al., 1999; Bardin et al., 2004; Balaban et al., 2005; Berdine et al., 2004; Manevich et al., 2005; Dor et al., 2000) that employ a similar approach. These works can handle limited reasoning with data such as total orders on the data domain, but again, do not support predicates like equality on data or function congruences resulting from equalities, and further, often address questions specific to lists. Our work, on the other hand, tackles the more general problem of the verification of uninterpreted heap programs and instantiates the alias-aware condition to the class of forest data-structures. The key idea of tracking the frontier heap locations, which we use for obtaining decidability of streaming-coherent programs, appears orthogonal to this line of work.
The work in (Bozga and Iosif, 2007) differs fundamentally from ours in that pointer updates are forbidden, which is a salient feature of our work. Pointer updates are at the heart of the difficulty in building a theory of uninterpreted programs working over heaps. Additionally, that work forbids nesting of loops as well as conditionals within loops, a restriction also used to obtain decidability in earlier work (Godoy and Tiwari, 2009) on uninterpreted programs; there is no such syntactic restriction on the class of programs we introduce in this paper.
The work in (Bouajjani et al., 2005) over-approximates the set of heap configurations associated with a given program location as a regular language over a finite alphabet. The program transformations are represented by finite state transducers, and the work employs an abstraction refinement approach for verifying heap-manipulating programs. Since such abstraction refinement loops may not always terminate for arbitrary programs, the proposed approach is only a semi-decision procedure. Further, this work does not support reasoning over the data sort. Other notable static analyses that employ abstraction refinement for verifying heap programs include the notable work on shape analysis (Sagiv et al., 1999; Yahav, 2001; Lev-Ami and Sagiv, 2000) and automatic predicate abstraction (Ball et al., 2001). Statically verifying memory safety using such incomplete techniques can of course, and commonly does, result in false positives.
There is a rich literature on program logics for heap verification; in particular separation logics (O’Hearn et al., 2001; Reynolds, 2002) and FO logics based on the principles of separation logic (Løding et al., 2019). Decidable fragments of such logics have also been studied (Berdine et al., 2004; Piskac et al., 2014b, a, 2013; Navarro Pérez and Rybalchenko, 2013, 2011; Berdine et al., 2006; Cook et al., 2011; Møller and Schwartzbach, 2001). However, typically, these decision procedures are for checking validity of Hoare triples, and the problem of generating loop invariants is often undecidable, as is the problem of completely automatic verification of programs against specifications expressed in these logics. Some invariant generation techniques have been discovered for problems in this domain (Calcagno et al., 2011; Neider et al., 2018), but are, of course, inherently incomplete.
9. Conclusions and Future Work
This paper establishes a foundational result for decidably verifying assertions in programs that update maps for a subclass that is alias-aware and coherent. We have used this general result to develop decidable verification of memory safety for a class of programs, called streaming-coherent programs, working on forest data-structures. We also proved membership of programs in this class is decidable. We showed through a prototype implementation of our decision procedure, and its evaluation on a set of single-pass algorithms, that forest data-structures typically fall in our decidable class, and that we can verify memory safety accurately for them.
The most compelling future direction is to adapt the technique in this paper to provide a memory safety analysis tool for a standard programming language (such as C/C++), handling the rest of the programming language using abstractions (e.g., arrays, allocation of varying blocks of memory, etc.). We believe that our automata-based algorithm will scale well. Realizing the techniques presented herein in a full-fledged memory safety analysis tool would be interesting.
On the theoretical front, there are several interesting directions. First, we could ask how to generalize our results to go beyond streaming-coherent programs on forest data-structures. Although forest data-structures are fairly common as initial heaps for many programs, finding a natural class of heap structures beyond forest data-structures where alias-awareness can be easily established seems an interesting, challenging problem. We believe that data-structures such as doubly-linked lists and trees with parent pointers, and more generally, data-structures that have bounded tree-width with uniform tree decompositions may be amenable to our technique. Tackling multi-pass algorithms on forest data-structures is also an interesting open direction. We believe the best way to look at our work in a larger verification context is that single-pass streaming-coherent programs are the new basic blocks that can be completely automatically analyzed, despite the fact that they contain loops. Putting these blocks together to handle programs with multiple passes over data-structures, even in an incomplete fashion, is an interesting future direction.
Another dimension for exploration is to consider more general post-conditions that can be proved automatically and that go beyond simple assertions. One of the limitations of our work is that, though we have an implicit precondition that demands that data-structures are forests, we do not establish that the data-structure in the post state is also a forest. The ability to establish such a property will allow us to maintain the forest property of data-structures as an invariant across multiple calls from clients that manipulate a data-structure using a library of methods, by showing that each of the methods provably maintains this invariant.
Acknowledgements.
We thank the anonymous reviewers of POPL for several comments that helped improve the paper. Umang Mathur is partially supported by a Google PhD Fellowship. This material is based upon work supported by the National Science Foundation under Grants NSF CCF 1901069 and NSF CCF 1527395.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Alur and Černý (2011) Rajeev Alur and Pavol Černý. 2011. Streaming Transducers for Algorithmic Verification of Single-pass List-processing Programs. SIGPLAN Not. 46, 1 (Jan. 2011), 599–610. https://doi.org/10.1145/1925844.1926454 · doi ↗
- 3Austin et al . (1994) Todd M. Austin, Scott E. Breach, and Gurindar S. Sohi. 1994. Efficient Detection of All Pointer and Array Access Errors. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI ’94) . ACM, New York, NY, USA, 290–301. https://doi.org/10.1145/178243.178446 · doi ↗
- 4Balaban et al . (2005) Ittai Balaban, Amir Pnueli, and Lenore D. Zuck. 2005. Shape Analysis by Predicate Abstraction. In Proceedings of the 6th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI’05) . Springer-Verlag, Berlin, Heidelberg, 164–180. https://doi.org/10.1007/978-3-540-30579-8_12 · doi ↗
- 5Ball et al . (2001) Thomas Ball, Rupak Majumdar, Todd Millstein, and Sriram K. Rajamani. 2001. Automatic Predicate Abstraction of C Programs. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI ’01) . ACM, New York, NY, USA, 203–213. https://doi.org/10.1145/378795.378846 · doi ↗
- 6Bardin et al . (2004) Sébastien Bardin, Alain Finkel, and David Nowak. 2004. Toward symbolic verification of programs handling pointers. (2004).
- 7Berdine et al . (2004) Josh Berdine, Cristiano Calcagno, and Peter W. O’Hearn. 2004. A Decidable Fragment of Separation Logic. In Proceedings of the 24th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS’04) . Springer-Verlag, Berlin, Heidelberg, 97–109. https://doi.org/10.1007/978-3-540-30538-5_9 · doi ↗
- 8Berdine et al . (2006) Josh Berdine, Cristiano Calcagno, and Peter W. O’Hearn. 2006. Smallfoot: Modular Automatic Assertion Checking with Separation Logic. In Formal Methods for Components and Objects , Frank S. de Boer, Marcello M. Bonsangue, Susanne Graf, and Willem-Paul de Roever (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 115–137.
