Differentially Testing Soundness and Precision of Program Analyzers

Christian Klinger; Maria Christakis; Valentin W\"ustholz

arXiv:1812.05033·cs.SE·December 18, 2018

Differentially Testing Soundness and Precision of Program Analyzers

Christian Klinger, Maria Christakis, Valentin W\"ustholz

PDF

1 Repo

TL;DR

This paper introduces an automated method for systematically comparing the soundness and precision of various program analyzers, revealing issues in many existing tools and aiding their improvement.

Contribution

It presents the first automated differential testing technique for program analyzers, enabling systematic evaluation of their soundness and precision.

Findings

01

Most analyzers had soundness or precision issues

02

The technique detected issues across six state-of-the-art analyzers

03

Implications for analyzer design and usage were analyzed

Abstract

In the last decades, numerous program analyzers have been developed both by academia and industry. Despite their abundance however, there is currently no systematic way of comparing the effectiveness of different analyzers on arbitrary code. In this paper, we present the first automated technique for differentially testing soundness and precision of program analyzers. We used our technique to compare six mature, state-of-the art analyzers on tens of thousands of automatically generated benchmarks. Our technique detected soundness and precision issues in most analyzers, and we evaluated the implications of these issues to both designers and users of program analyzers.

Tables7

Table 1. Table 1 : Relative precision for the tested program analyzers.

	CBMC	CPAchecker	Crab	SeaHorn	SMACK	UAutomatizer
CBMC	$1.00$	$0.01$	$0.01$	$0.01$	$0.01$	$0.01$
CPAchecker	$0.71$	$1.00$	$0.99$	$0.95$	$0.97$	$0.99$
Crab	$0.37$	$0.63$	$1.00$	$0.62$	$0.88$	$0.68$
SeaHorn	$0.77$	$0.98$	$< 1.00$	$1.00$	$0.99$	$< 1.00$
SMACK	$0.72$	$0.53$	$0.76$	$0.53$	$1.00$	$0.61$
UAutomatizer	$0.57$	$0.82$	$0.89$	$0.81$	$0.93$	$1.00$

Table 2. Table 2 : Comparison of results for SMACK and Crab.

	unsafe	unknown	safe
unsafe	$14$ $%$	$22$ $%$	$> 0$ $%$
unknown	$30$ $%$	$0$ $%$	$7$ $%$
safe	$3$ $%$	$0$ $%$	$23$ $%$

Table 3. Table 3 : Comparison of Crab Octagons and Polyhedra.

	unsafe	unknown	safe
unsafe	$43$ $%$	$1$ $%$	$> 0$ $%$
unknown	$> 0$ $%$	$10$ $%$	$> 0$ $%$
safe	$> 0$ $%$	$> 0$ $%$	$45$ $%$

Table 4. Table 4 : Relative precision for abstract domains of Crab.

	int	oct	pk	rtz	zones
int	$1.00$	$0.97$	$0.97$	$0.97$	$0.89$
oct	$0.99$	$1.00$	$0.99$	$0.99$	$0.99$
pk	$0.99$	$0.99$	$1.00$	$0.98$	$0.99$
rtz	$< 1.00$	$< 1.00$	$0.99$	$1.00$	$< 1.00$
zones	$1.00$	$< 1.00$	$0.99$	$0.99$	$1.00$

Table 5. Table 5 : Survey responses from professional C developers.

Task		Survey Response
Identifier		unsafe	unknown	safe
$1$	(unsafe)	$11$	$0$	$1$
$2$	(unsafe)	$12$	$0$	$0$
$3$	(unsafe)	$11$	$0$	$1$
$4$	(safe)	$2$	$0$	$10$
$5$	(safe)	$0$	$0$	$12$
$6$	(safe)	$0$	$0$	$12$
$7$	(unsafe)	$10$	$1$	$1$
$8$	(unsafe)	$10$	$1$	$1$
$9$	(unsafe)	$10$	$1$	$1$

Table 6. Table 6 : Effect of search strategies on the number of detected issues.

Search Strategy	Number of Detected Issues
	must-unsound	$\geq 𝟑$ -unsound	$\geq 𝟑$ -imprecise
Uniform-Random	$2$	$2$	$71$
Breadth-Biased	$1$	$2$	$70$
Depth-Biased	$1$	$1$	$77$
Random-Walk	$0$	$0$	$59$
Guided-Walk	$1$	$2$	$39$

Table 7. Table 7 : Effect of batch-check size on the number of detected issues.

Batch-Check Size	Number of Detected Issues
	must-unsound	$\geq 𝟑$ -unsound	$\geq 𝟑$ -imprecise
$1$	$2$	$2$	$71$
$2$	$6$	$6$	$110$
$4$	$3$	$5$	$158$
$8$	$10$	$10$	$193$
$16$	$9$	$10$	$165$
$32$	$11$	$22$	$201$

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Practical-Formal-Methods/adiff
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Differentially Testing

Soundness and Precision of Program Analyzers

Christian Klinger

*Saarland University, Germany

*[email protected]

Maria Christakis

*MPI-SWS, Germany

*[email protected]

Valentin Wüstholz

*ConsenSys, Germany

*[email protected]

Abstract

In the last decades, numerous program analyzers have been developed both by academia and industry. Despite their abundance however, there is currently no systematic way of comparing the effectiveness of different analyzers on arbitrary code. In this paper, we present the first automated technique for differentially testing soundness and precision of program analyzers. We used our technique to compare six mature, state-of-the art analyzers on tens of thousands of automatically generated benchmarks. Our technique detected soundness and precision issues in most analyzers, and we evaluated the implications of these issues to both designers and users of program analyzers.

I Introduction

Program analyzers are very effective in detecting code issues and, especially in recent years, they are increasingly applied in industry to detect defects in real-world software. Notable examples of such analyzers are Facebook’s Infer [1, 2], which is used to find resource leaks and null-pointer exceptions in Android and Java applications, Microsoft’s SAGE [3], which is credited to have detected about one third of all security bugs that were discovered with fuzzers during the development of Windows 7 [4], and AbsInt’s Astrée [5, 6], which can prove the absence of runtime errors and invalid concurrent behavior in safety-critical software, such as that of Airbus.

Despite the abundance of program analyzers that have been produced both by academia and industry, there is currently no systematic way of comparing their effectiveness on arbitrary code. To compare the soundness and precision of a set of analyzers, one could try them on a number of programs to get a feel for their false positive or false negative rates. However, just classifying the generated warnings as false or true positives would require considerable human effort, let alone determining whether any bugs are missed.

Alternatively, one could rely on the outcome of software verification competitions such as SV-COMP [7], which compares program analyzers based on an established collection of verification tasks. Although verification competitions are extremely valuable in increasing the visibility of new analyzers, providing a discussion forum for state-of-the-art techniques, and maintaining a set of programs with explicit properties to be checked, the verification tasks are rather stable. As a consequence, program analyzers can be designed to perform well in such competitions by specifically tailoring their techniques to the given benchmarks. In addition, the explicit checks might not be representative of properties in real-world software.

Our approach. To address these issues, we present the first automated technique for differentially testing program analyzers on arbitrary code. Given a set of seed programs, our approach automatically generates program-analysis benchmarks and compares the soundness and precision of the analyzers on these benchmarks. As a result, the effectiveness of the different program analyzers is evaluated in a systematic and automated way, the benchmarks are hardly predictable, and the explicit checks can be parameterized to express several types of properties, for instance, numerical, non-nullness, or points-to properties.

However, as for existing differential testing techniques, it is challenging to automatically derive the ground truth, for example, which warnings are indeed true positives or which errors are missed. We address this challenge by leveraging Engler et al.’s “bugs-as-deviant-behavior” strategy [8]. Specifically, when most program analyzers agree that a certain property does not hold, our approach detects a potential soundness issue in the deviant analyzers, which find that the property does hold. Conversely, we detect a potential precision issue when a few analyzers claim that a property does not hold, while the majority of analyzers verify the property.

The work most closely related to ours is by Kapus and Cadar, who use random program generation and differential testing to find bugs in symbolic execution engines [9]. In contrast to this work, our approach focuses on detecting soundness and precision issues in any program analyzer, potentially including a test generator based on symbolic execution. Moreover, our technique automatically generates program-analysis benchmarks from a given set of seed programs, possibly containing code that is difficult to handle by program analysis. In general, it is very challenging to randomly generate programs from scratch such that they reveal soundness and precision issues in mature analyzers, which is why our technique leverages the seed programs.

Overall, we expect our approach to guide users in making informed choices when selecting a program analyzer. However, this is not to say that the best analyzer is the most sound; users have varying needs depending on how critical the correctness of their code is and where in the software development cycle they are (e.g., before a code review or product release) [10]. We also expect our technique to assist designers of analyzers in detecting soundness and precision issues of their implementation, and to help enrich the collection of tasks used in verification competitions by automatically generating challenging, yet less predictable, benchmarks.

Contributions. We make the following contributions:

–

We present the first automated technique for differentially testing soundness and precision of program analyzers on arbitrary code.

–

We implement this technique in a tool architecture that compares analyzers on C programs and may be instantiated with any program analyzer for C.

–

We used our technique to compare six state-of-the-art program analyzers on about 26,000 programs. We found soundness and precision issues in four (out of six) analyzers, and we evaluated their significance to both program-analysis designers and users.

Terminology. Since terminology varies across different program-analysis techniques, we introduce the following terms that we use throughout the paper. Sound program analysis over-approximates the set of possible executions in a given program in order to prove the absence of errors in the program. Due to its over-approximation, sound analysis may generate false positives, that is, spurious warnings about executions that are not erroneous or even possible in the program. In contrast, a true positive is a warning about an actual error in the program. Unsound program analysis may over-approximate certain program executions and under-approximate others in order to find bugs, rather than prove their absence. In case an unsound analysis fails to detect an error in the program, we say that it generates a false negative.

An imprecise program analysis abstracts certain program executions such that it considers more executions than those feasible in the program. Although an imprecise analysis might generate false positives, it is not necessarily sound. On the other hand, a precise program analysis uses abstractions that closely describe executions in the program to generate as few false positives as possible.

Outline. The next section gives an overview of our approach through a running example. Sect. III explains the technical details of our approach, while Sect. IV describes our implementation. We present our experimental evaluation and threats to the validity of our results in Sect. V. We discuss related work in Sect. VI and conclude in Sect. VII.

II Overview

In this section, we illustrate the workflow and tool architecture of our differential testing technique for program analyzers, shown in Fig. 1.

Workflow. Our technique, which is implemented in a tool called $\alpha$ -Diff (pronounced “alpha-diff”), takes as input one or more correct seed programs that do not contain any (explicit or implicit) assertions. Because these programs are both correct and assertion-free, program analyzers (even if sound and imprecise) do not generate any warnings for them.

Next, $\alpha$ -Diff parses a seed program, and based on one of its search strategies (Sect. III-A), a program location is selected. At this location, $\alpha$ -Diff synthesizes and introduces a check, in the form of an assertion, expressing a property of interest (e.g., a numerical property) and involving variables that are in scope at the program location (Sect. III-A). We call the resulting program a variant of the seed program.

On this program variant, which now contains a single assertion, $\alpha$ -Diff runs a set of program analyzers. The results of the analyzers, that is, the presence or absence of any generated warnings for the assertion, are recorded.

Subsequently, $\alpha$ -Diff selects a new program location in the same seed program and repeats the process until a given budget (i.e., number of synthesized checks for a particular seed program) is depleted. The tool then continues to parse another seed program, if any.

When all seed programs have been instrumented and analyzed until the budget, $\alpha$ -Diff compares the recorded results and reports any detected soundness and precision issues in the analyzers (Sect. III-B).

Example. Fig. II shows a simplified version of an SV-COMP benchmark that we used in our experimental evaluation as a seed program to test six program analyzers. (Lines 4 and 7 should be ignored for now.) Note that this program is correct and does not contain any assertions. Therefore, when running all analyzers on this code, no warnings are generated.

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Calcagno, D. Distefano, J. Dubreil, D. Gabi, P. Hooimeijer, M. Luca, P. W. O’Hearn, I. Papakonstantinou, J. Purbrick, and D. Rodriguez, ‘‘Moving fast with software verification,’’ in NFM , ser. LNCS, vol. 9058. Springer, 2015, pp. 3--11.
2[2] C. Calcagno and D. Distefano, ‘‘Infer: An automatic program verifier for memory safety of C programs,’’ in NFM , ser. LNCS, vol. 6617. Springer, 2011, pp. 459--465.
3[3] P. Godefroid, M. Y. Levin, and D. A. Molnar, ‘‘Automated whitebox fuzz testing,’’ in NDSS . The Internet Society, 2008, pp. 151--166.
4[4] E. Bounimova, P. Godefroid, and D. A. Molnar, ‘‘Billions and billions of constraints: Whitebox fuzz testing in production,’’ in ICSE . IEEE Computer Society/ACM, 2013, pp. 122--131.
5[5] B. Blanchet, P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Miné, D. Monniaux, and X. Rival, ‘‘A static analyzer for large safety-critical software,’’ in PLDI . ACM, 2003, pp. 196--207.
6[6] P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Miné, D. Monniaux, and X. Rival, ‘‘The ASTRÉE analyzer,’’ in ESOP , ser. LNCS, vol. 3444. Springer, 2005, pp. 21--30.
7[7] D. Beyer, ‘‘Competition on software verification (SV-COMP),’’ 2017, https://sv-comp.sosy-lab.org .
8[8] D. R. Engler, D. Y. Chen, and A. Chou, ‘‘Bugs as deviant behavior: A general approach to inferring errors in systems code,’’ in SOSP . ACM, 2001, pp. 57--72.