Darwinian Data Structure Selection

Michail Basios; Lingbo Li; Fan Wu; Leslie Kanthan; Earl Barr

arXiv:1706.03232·cs.SE·August 2, 2018

Darwinian Data Structure Selection

Michail Basios, Lingbo Li, Fan Wu, Leslie Kanthan, Earl Barr

PDF

TL;DR

The paper introduces ARTEMIS, a cloud-based optimization framework that automatically selects and tunes data structures to significantly enhance application performance and resource efficiency across various Java projects.

Contribution

It presents a novel multi-objective search-based approach for automatic data structure selection and tuning, demonstrating substantial performance gains in real-world Java applications.

Findings

01

At least one solution improves all measures in 86% of projects.

02

Median improvements are 4.8% in runtime, 10.1% in memory, 5.1% in CPU.

03

Significant improvements in popular libraries like gson and xalan.

Abstract

Data structure selection and tuning is laborious but can vastly improve an application's performance and memory footprint. Some data structures share a common interface and enjoy multiple implementations. We call them Darwinian Data Structures (DDS), since we can subject their implementations to survival of the fittest. We introduce ARTEMIS a multi-objective, cloud-based search-based optimisation framework that automatically finds optimal, tuned DDS modulo a test suite, then changes an application to use that DDS. ARTEMIS achieves substantial performance improvements for \emph{every} project in $5$ Java projects from DaCapo benchmark, $8$ popular projects and $30$ uniformly sampled projects from GitHub. For execution time, CPU usage, and memory consumption, ARTEMIS finds at least one solution that improves \emph{all} measures for $86%$ ( $37/43$ ) of the projects. The median improvement…

Tables2

Table 1. Table 1. Data structure groups.

Abstract Data Type	Implementation
List	ArrayList, LinkedList
Map	HashMap, LinkedHashMap
Set	HashSet, LinkedHashSet
Concurrent List	Vector, CopyOnWriteArrayList
Concurrent Deque	ConcurrentLinkedDeque, LinkedBlockingDeque
Thread Safe Queue	ArrayBlockingQueue, SynchronousQueue, LinkedBlockingQueue, DelayQueue, ConcurrentLinkedQueue, LinkedTransferQueue

Table 2. Table 2. DDS changes for optimal solutions across all measures.

Tranformation	Time	Memory	CPU
HashMap -> LinkedHashMap	60	53	57
LinkedList -> ArrayList	16	13	18
HashSet -> LinkedHashSet	22	21	21
LinkedBlockingQueue -> LinkedTransferQueue	1	2	2
ArrayList -> LinkedList	91	86	87
LinkedHashSet -> HashSet	7	8	5
Vector -> CopyOnWriteArrayList	1	0	2
LinkedHashMap -> HashMap	17	23	19

Equations8

{P [(n . c (x_{i}))^{k} / (n . c (x_{j}))^{k}], if \exists d_{i}, d_{j} s.t. adte (d_{i}) \neq = adte (d_{j}) P [(n, d_{i})^{k} / (n, d_{j})^{k}] [(n . c (x_{i}))^{k} / (n . c (x_{j}))^{k}], otherwise

{P [(n . c (x_{i}))^{k} / (n . c (x_{j}))^{k}], if \exists d_{i}, d_{j} s.t. adte (d_{i}) \neq = adte (d_{j}) P [(n, d_{i})^{k} / (n, d_{j})^{k}] [(n . c (x_{i}))^{k} / (n . c (x_{j}))^{k}], otherwise

(n_{i}, d_{i})^{k} \in Γ_{D}^{k}, d_{j}^{k} \in adte (d_{i}, C)^{k}, x_{j} \in τ arg max f (ϕ (P, (n_{i}, d_{i}), d_{j}, x_{j}))

(n_{i}, d_{i})^{k} \in Γ_{D}^{k}, d_{j}^{k} \in adte (d_{i}, C)^{k}, x_{j} \in τ arg max f (ϕ (P, (n_{i}, d_{i}), d_{j}, x_{j}))

D D S S = a \in A ⋃ dse (a, C) .

D D S S = a \in A ⋃ dse (a, C) .

d \in D \prod I (d) * ∣ d o m (d . c) ∣, where d . c is d ’s constructor.

d \in D \prod I (d) * ∣ d o m (d . c) ∣, where d . c is d ’s constructor.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Darwinian Data Structure Selection

Michail Basios, Lingbo Li, Fan Wu, Leslie Kanthan, Earl T. Barr

University College London, UK

michail.basios,lingbo.li,fan.wu,l.kanthan,[email protected]

(2018)

Abstract.

Data structure selection and tuning is laborious but can vastly improve an application’s performance and memory footprint. Some data structures share a common interface and enjoy multiple implementations. We call them Darwinian Data Structures (DDS), since we can subject their implementations to survival of the fittest. We introduce artemis a multi-objective, cloud-based search-based optimisation framework that automatically finds optimal, tuned DDS modulo a test suite, then changes an application to use that DDS. artemis achieves substantial performance improvements for every project in $5$ Java projects from DaCapo benchmark, $8$ popular projects and $30$ uniformly sampled projects from GitHub. For execution time, CPU usage, and memory consumption, artemis finds at least one solution that improves all measures for $86\%$ ( $37/43$ ) of the projects. The median improvement across the best solutions is $4.8\%$ , $10.1\%$ , $5.1\%$ for runtime, memory and CPU usage.

These aggregate results understate artemis’s potential impact. Some of the benchmarks it improves are libraries or utility functions. Two examples are gson, a ubiquitous Java serialization framework, and xalan, Apache’s XML transformation tool. artemis improves gson by $16.5$ %, $1\%$ and $2.2\%$ for memory, runtime, and CPU; artemis improves xalan’s memory consumption by $23.5$ %. Every client of these projects will benefit from these performance improvements.

Search-based Software Engineering, Genetic Improvement, Software Analysis and Optimisation, Data Structure Optimisation

††journalyear: 2018††copyright: acmlicensed††conference: Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; November 4–9, 2018; Lake Buena Vista, FL, USA††booktitle: Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’18), November 4–9, 2018, Lake Buena Vista, FL, USA††price: 15.00††doi: 10.1145/3236024.3236043††isbn: 978-1-4503-5573-5/18/11††ccs: Software and its engineering Software evolution

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

— Donald E. Knuth (Knuth:1974:SPG:356635.356640, )

1. Introduction

Under the immense time pressures of industrial software development, developers are heeding one part of Knuth’s advice: they are avoiding premature optimisation. Indeed, developers appear to be avoiding optimisation altogether and neglecting the “critical $3\%$ ". When selecting data structures from libraries, in particular, they tend to rely on defaults and neglect potential optimisations that alternative implementations or tuning parameters can offer. This, despite the impact that data structure selection and tuning can have on application performance and defects. Consider three examples. Selecting an implementation that creates unnecessary temporary objects for the program’s workload (xu2009go, ). Selecting a combination of Scala data structures that scaled better, reducing execution time from $45$ to $1.5$ minutes (story2, ). Avoiding the use of poor implementation, such as those in the Oracle bug database that leak memory (xu2008precise, ).

Optimisation is time-consuming, especially on large code bases. It is also brittle. An optimisation for one version of a program can break or become a de-optimisation in the next release. Another reason developers may avoid optimisation are development fads that focus on fast solutions, like “Premature Optimisation is the horror of all Evil" and “Hack until it works" (story3, ). In short, optimisation is expensive and its benefits unclear for many projects. Developers need automated help.

Data structures are a particularly attractive optimisation target because they have a well-defined interface; many are tunable; and different implementations of a data structure usually represent a particular trade-off between time and storage, making some operations faster but more space-consuming or slower but more space-efficient. For instance, an ordered list makes retrieving the entire dataset in sorted order fast, but inserting new elements slow, whilst a hash table allows for quick insertions and retrievals of specific items, but listing the entire set in order is slow. We introduce Darwinian data structures, distinct data structures that are interchangeable because they share an abstract data type and can be tuned. The Darwinian data structure optimisation problem is the problem of finding an optimal implementation and tuning for a Darwinian data structure used in an input program.

We aim to help developers perform optimisation cheaply, focusing solving the data structure optimisation problem. We present artemis, a cloud-based optimisation framework that identifies Darwinian data structures and, given a test suite, automatically searches for optimal combinations of implementations and parameters for them. artemis is language-agnostic; we have instantiated it for Java and C++, and present optimisation results for both languages (Section 5). artemis’ search is multi-objective, seeking to simultaneously improve a program’s execution time, memory usage, and CPU usage while passing all the test suites. artemis scales to large code bases because is uses a Genetic algorithm on those regions of its search space with the most solutions (Section 4.4). artemis is the first technique to apply multi-objective optimisation to the Darwinian data structure selection and tuning problem.

artemis promises to change the economics of data structure optimisation. Given a set of Darwinian data structures, artemis can search for optimal solutions in the background on the cloud, freeing developers to focus on new features. artemis makes economical small optimizations, such as a few percent, that would not pay for the developer time spent realizing them. And sometimes, of course, artemis, by virtue of being used, will find unexpectedly big performance gains.

artemis is a source-to-source transformer. When artemis finds a solution, the program variants it produces differ from the original program only at constructor calls and relevant type annotations. Thus, artemis’ variants are amenable, by design, to programmer inspection and do not increase technical debt (brown2010managing, ). To ease inspection, artemis generates a diff for each changes it makes. Developers can inspect these diffs and decide which to keep and which to reject.

We report results on $8$ popular diverse GitHub projects, on DaCapo benchmark which was constructed to be representative, and a corpus of $30$ GitHub projects, filtered to meet artemis’s constraints and sampled uniformly at random. In this study, artemis achieves substantial performance improvements for all $43$ projects in its corpus. In terms of execution time, CPU usage, and memory consumption, artemis finds at least one solution for $37$ out of $43$ projects that improves all measures. Across all produced optimal solutions, the median improvement for execution time is $4.8\%$ , memory consumption $10.1\%$ and CPU usage $5.1\%$ . This result is for various corpora, but it is highly likely to generalise to arbitrary programs because of the care we took to build a representative corpus (Section 5.1).

These aggregate results understate artemis’s potential impact. Some of our benchmarks are libraries or utilities. All of their clients will enjoy any improvements artemis finds for them. Three examples are the Apache project’s powerful XSLT processor xalan, Google-http-java-client, the unbiquitious Java library for accessing web resources, and Google’s in-memory file system Jimfs. Section 5 shows that artemis improved xalan’s memory consumption by $23.5\%$ , while leaving its execution time unchanged; artemis improved Google-http-java-client’s execution time by $46$ % and its CPU usage by $39.6$ %; finally, artemis improved Jimfs’s execution time by $14.2\%$ and its CPU usage by $10.7$ %, while leaving its memory consumption unchanged.

Our principal contributions follow:

•

We formalise the Darwinian data structure selection and optimisation problem DS2 (Section 3).

•

We implement artemis, a multi-language optimisation framework that automatically discovers and optimises sub-optimal data structures and their parameters.

•

We show the generalizability and effectiveness of artemis by conducting a large empirical study on a corpus comprising $8$ popular GitHub project, $5$ projects from the DaCapo benchmark, and $30$ Github projects, filtered then sampled uniformly. For all $43$ subjects, artemis find variants that outperforms the original for all three objectives.

•

We provide artemis as a service, along with its code and evaluation artifacts at http://darwinianoptimiser.com.

2. Motivating example

1 contains a code snippet from google-http-java-client 111https://github.com/google/google-http-java-client, a popular Java library for accessing efficiently resources on the web. In the 1, getAsList packages HTTP headers and is invoked frequently from other methods because they use it every time they construct an HTTP request. Thus, its functionality is important for the performance of google-http-java-client.

1 uses ArrayList to instantiate the result variable. However, other List implementations share the same functionality but different non-functional properties. Thus, replacing ArrayList with other List implementations may affect the performance of the program. Considering the variant created when replacing ArrayList (1, line 4) with LinkedList, when we compare it with the original program against the same test set for $30$ runs (Section 4), we see that the google-http-java-client achieves a median $46\%$ , with 95% Confidence Interval [ $45.6\%$ , $46.3\%$ ] improvement in execution time (Section 5).

artemis, our optimization framework, automatically discovers underperforming data structures and replaces them with better choices using search-based techniques (Section 4.4). First, it automatically creates a store of data structures from the language’s Collection api library (Section 4.1). Then, artemis traverses the program’s ast to identify which of those data structures are used and exposes them as parameters to the artemis’s optimizer (Section 4.4) by transforming line 4 into

⬇

List<T> result = new D<T>();

where $D$ is the tag that refers to the exposed parameter associated with the defined data structure type (Section 4).

1 does not specify the initial capacity size of the ArrayList, so the default size $10$ was used. If the instantiated List object contains less than $10$ items, the default capacity can result in memory bloat. If the List object contains more than $10$ items, the default capacity can slow the execution time; more memory realllocation operations will happen. Therefore, an appropriate value must be chosen to find a good tradeoff between memory and execution time.

artemis automatically exposes such arguments as tunable parameters, then adjusts them to improve the runtime performance of the program. For instance, artemis changes line 4 to the code below:

⬇

List<T> l = new ArrayList<>(S);

where $S$ refers to the exposed parameter associated with the amount of pre-allocated memory.

3. Darwinian Data Structure Selection and Tuning

This section defines the Darwinian data structure and parameter optimisation problem we solve in this paper.

Definition 1 (Abstract Data Type).

An Abstract Data Type (ADT) is class of objects whose logical behavior is defined by a set of values and a set of operations (dale1996abstract, ).

A data structure concretely implements an ADT. For the corpus of programs $C$ and the ADT $a$ , the data structure extraction function $\mathit{dse}(a,C)$ returns all data structures that implement $a$ in $C$ . This function is integral to the definition that follows.

Definition 2 (Darwinian Data Structure).

When $\exists d_{0},d_{1}\in\mathit{dse}(a,C)\wedge d_{0}\neq d_{1}\wedge d_{0}$ and $d_{1}$ are observationally equivalent modulo $a$ , $d_{0}$ and $d_{1}$ are Darwinian data structures.

In words, Darwinian data structures are $darwinian$ in the sense that they can be replaced to produce program mutants whose fitness we can evaluate. The ADT $a$ has Darwinian data structures when it has more than one data structure that are equivalent over the operations the ADT defines. In Java, List is an ADT and ArrayList, which implements it, is a data structure. LinkedList also implements List, so both ArrayList and LinkedList are Darwinian. For the ADT $a$ and the corpus $C$ , Darwinian data structures are interchangeable. Thus, we can search the variants of $P\in C$ formed by substituting one Darwinian data structure for another to improve $P$ ’s nonfunctional properties, like execution time, memory consumption or CPU usage.

Just as we needed a function to extract an ADT’s data structures from a corpus for Definition 2, we need a function that returns the ADT that a data structure implements: when $d=\mathit{dse}(a,C)$ , let $\mathit{adte}(d,C)=a$ . Let $\Gamma_{D}$ bind fully scope-qualified declarations of names to Darwinian data structures in $C$ . We use $\Gamma_{D}$ when creating variants of a program via substitution. We are interested not just searching the space of Darwinian data structures, but also tuning them via their constructor parameters. To this end, we assume without loss of generality that $a$ defines a single constructor $c$ and we let $n.c(x)$ denote calling identifier $n$ ’s constructor $c$ with parameters $x:\tau$ .

To create a variant of $P\in C$ that differs from $P$ only in its $k$ bindings of names to Darwinian data structures or their constructor initialization parameter, we define

$\phi(P,(n,d_{i})^{k},d_{j}^{k},x_{j})=$

[TABLE]

Definition 3 (Darwinian Data Structure Selection and Tuning).

For the real-valued fitness function $f$ over the corpus $C$ , the Darwinian data structure and tuning problem is

[TABLE]

This vector-based definition simultaneously considers all possible rebinding of names to Darwinian data structures in $P$ ; it is also cumbersome, compared to its point-substitution analogue. We could not, however, simply present a scalar definition and then quantify over all potential DDSS substitutions, as doing so would not surface synergies and antagonisms among the substitutions.

4. Artemis

The artemis’s optimisation framework solves the Darwinian Data Structure Selection problem. Figure 1 illustrates the architecture with its three main components: the darwinian data structures store generator (ddssg), the extractor, and the optimiser. artemis takes the language’s Collection api library, the user’s application source code and a test suite as input to generate an optimised version of the code with a new combination of data structures. The ddssg builds a store that contains data structure transformation rules. The extractor uses this store to discover potential data structure transformations and exposes them as tunable parameters to the optimiser (see Section 4.2). The optimiser uses a multi-objective genetic search algorithm (NSGA-II (nsgaii, )) to tune the parameters (fan2015, ; langdon2014improving, ; fan3, ; lingbo3, ; lingbo4, ) and to provide optimised solutions (see Section 4.4). A regression test suite is used to maintain the correctness of the transformations and to evaluate the non-functional properties of interest. artemis uses a built-in profiler that measures execution time, memory and CPU usage.

artemis relies on testing to define a performance search space and to preserve semantics. artemis therefore can only be applied to programs with a test suite. Ideally, this test suite would comprise both a regression test suite with high code coverage for maintaining the correctness of the program and a performance test suite to simulate the real-life behaviour of the program and ensure that all of the common features are covered (binder2000testing, ). Even though performance test suites are a more appropriate and logical choice for evaluating the non-functional properties of the program, most real world programs in GitHub do not provide such performance test suite. For this reason, we use the regression test suites to evaluate the non-functional properties of the GitHub projects of this study whenever a performance test suite is not available.

4.1. Darwinian Data Structure Store

artemis needs a Darwinian data structure store (ddss) from which to choose when creating variants. Let $A$ be a set of ADTs known to be Darwinian. A developer can augment this set; Figure 2 shows those that artemis knows by default. For our corpus $C$ of Java benchmarks augmented with JDK libraries over $A$ ,

[TABLE]

To build the default ddss for Java, artemis extracts and traverses each project’s class hierarchy, similar to the one illustrated in Figure 2. This hierarchy shows potential Darwinian data structures of a specific interface. When this traversal finishes, artemis extracts all the implementations of a particular Darwinian data structure; e.g., List, ArrayList, LinkedList. artemis considers these implementations mutually replaceable. For Java, a default ddss is provided by artemis, which the developer can edit. For other languages, the ddss can be provided manually by the user and this step can be skipped. The optimiser, described next, uses the store during its search.

The developer can also extend the store with custom user-supplied implementations or with implementations from other third-party libraries such as Google Guava Collections222 https://github.com/google/guava, fastutil333 https://github.com/vigna/fastutil and Apache Commons Collections444 https://github.com/apache/commons-collections.

4.2. Discovering Darwinian Data Structures

The extractor takes as input the program $P$ ’s source code, identifies Darwinian data structures in $P$ modulo its store (Section 4.1), and outputs a scope-qualified list of names of Darwinian data structures and their constructor parameters (Extracted Data Structures and Parameters in Figure 1). For all $a\in\text{DDSS}$ , extractor’s output realises $\mathit{dse}(a,P)$ (Section 3). To mark potential substitions to the transformer, the extractor outputs a templated version of the code which replaces the data structure with data structure type identifiers (Templated Source Code in Figure 1).

To find darwinian data structures, the extractor builds an Abstract Syntax Tree (ast) from its input source code. It then traverses the ast to discover potential data structure transformations based on a store of data structures as shown in Table 1. For example, when an expression node of the ast contains a LinkedList expression, the extractor marks this expression as a potential darwinian data structure that can take values from the available List implementations: LinkedList or ArrayList. The extractor maintains a copy of the ast, referred to as the rewriter, where it applies transformations, without changing the initial ast. When the ast transformation finishes, the rewriter produces the final source code which is saved as a new file.

4.3. Code Transformations

When implementing artemis, we encountered coding practices that vastly increase the search space. Many turn out to be known bad practices (designpattern1, ). Consider 2. In lines 2 and 8, we see two LinkedList variables that the Extractor marks darwinian and candidates for replacement by their equivalent ArrayList implementation. In these lines, user is violating the precept to "program to the interface", here List, but is, instead, declaring the variable to have the concrete, data structure not ADT, type LinkedList. This bad practice (designpattern1, ) adds dependencies to the code, limiting code reuse. They are especially problematic for artemis, because they force artemis to apply multiple transformations to replace and optimise the data structure. Further, func3 takes a LinkedList as a parameter, not List, despite the fact that it only calls the get method defined by List on this parameter. This instance of violating the "program to the interface" precept triggers a compilation error if artemis naïvely changes func1’s type. artemis transforms the code to reduce the optimiser’s search space and handle these bad practices. artemis supports thee transformations - parserless, supertype, and profiler.

The parserless mode changes greadily each appearance of a Darwinian implementation. When optimising List, it exhaustively tries every implementation of List for every List variable. It is parserless, since it needs only a regular expression to identify rewritings. This makes it simple, easily portable to other languages, and fast, so it is artemis’ default. However, it generates invalid programs and a large search space.

artemis’ sypertype transformation converts the type of a Darwinian implementation to that of their Darwinian ADT, for example LinkedList<T> $\rightarrow$ List<T> on lines, $2$ , $7$ , $8$ and $11$ . For 2, this tranformation exposes only two DDS to the optimiser and produces only syntactically valid code. To implement this transformation, artemis invokes Eclipse’s re-factoring functionality via its API, then validates the result. artemis aims to be language-agnostic without any additional dependencies on language specific tools. For this case, artemis auto performs this transformation by adding the supertype as an equivalent parameter in the store of data structures. Whenever the ast visitor traverses a variable or parameter declaration expression it may replace the darwinian data structure with its supertype.

"All data structures are equal, but some data structures are more equal than others" 555Adapted from ”Animal Farm” by George Orwell; some DDS affect a program’s performance more than others, as when one stores only a few, rarely accessed items. To rank DDS, artemis profiles its input program to identify costly methods. The extractor uses this info to identify the subset of a program’s DDS worth considering for optimisation. artemis’ instrumentation is particularly important for large programs.

4.4. Search Based Parameter Tuning

The optimiser searches a combination of data structures that improves the performance of the initial program while keeps the original functionality. Practically, we can represent all those data structures as parameters that can be tuned using Search Based Software Engineering approaches (harman2007current, ). Because of the nature of the various conflicting performance objectives, the problem we faced here requires a multi-objective optimisation approach to search the (near) optimal solutions.

An array of integers is used to represent the tuning parameters. Each parameter refers either to a Darwinian data structure or to the initial size of that data structure. If the parameter refers to a data structure, its value represents the index in the list of Darwinian data structures. The optimiser keeps additional mapping information to distinguish the types of the parameters. For each generation, the NSGA-II applies tournament selection, followed by a uniform crossover and a uniform mutation operation. In our experiments, we designed fitness functions to capture execution time, memory consumption, and CPU usage. After fitness evaluation, artemis applies standard non-dominated selection to form the next generation. artemis repeats this process until the solutions in a generation converge. At this point, artemis returns all non-dominated solutions in the final population.

Search Space size: We used GA because the search space is huge. Let $D$ be the definitions of darwinian data structures in program $P$ . Let $I$ be the number of implementations for a particular $d\in D$ . The size of the search space is:

[TABLE]

4.5. Deployability

artemis provides optimisation as a cloud service. To use the service, developers only need to provide the source code of their project in a Maven build format and a performance test suite invoked by mvn test. artemis returns the optimised source code and a performance report. artemis exposes a RESTful API that developers can use to edit the default store of Darwinian data structures. The API also allows developers to select other Search Based algorithms; the optimiser uses NSGA-II by default. To use our tool from the command line, a simple command is used:

⬇

./artemis input-program-src

where this command defaults to artemis’s built in ddssg. artemis writes the source of an optimized variant of its input for each measure. artemis also supports optional parameters to customise its processing.

5. Evaluation

To demonstrate the performance improvements that artemis automatically achieves and its broad applicability, we applied it to three corpora: $8$ popular GitHub projects, $5$ projects from the Dacapo Benchmark, and $30$ projects, filtered to meet artemis’s requirements, then sampled uniformly at random from Github. To show also that artemisis language-agnostic, we applied it to optimise Guetzli666https://github.com/google/guetzli (Section 5.3), a JPEG encoder written in C++.

5.1. Corpus

artemis requires projects with capable build systems and an extensive test suites. These two requirements entail that artemis be able to build and run the project against its test suite. artemis is language-agnostic but is currently only instantiated for Java and C++, so it requires Java or C++ programs.

Our first corpus comprises eight popular GitHub projects. We selected these eight to have good test suites and be diverse. We defined popular to be projects that received at least $200$ stars on GitHub. We deemed a test suite to be good if its line coverage met or exceeded $70$ %. This corpus contains projects, usually well-written, optimised and peer code-reviewed by experienced developers. We applied artemis on those projects to investigate whether it can provide a better combination of data structures than those selected by experienced human developers.

This first corpus might not be representative, precisely because of the popularity of its benchmarks. To address this threat to validity, we turned to the DaCapo benchmarks (blackburn2006dacapo, ). The authors of DaCapo built it, from the ground up, to be representative. The goal was to provide the research community with realistic, large scale Java benchmarks that contain a good methodology for Java evaluation. Dacapo contains $14$ open source, client-side Java benchmarks (version $9.12$ ) and they come with built-in extensive evaluation. Each benchmark provides accurate measurements for execution time and memory consumption. DaCapo first appeared in 2006 to work with Java v.1.5 and has not been further updated to work with newer versions of Java. For this reason, we faced difficulties in compiling all the benchmarks and the total number of benchmarks were reduced to $5$ out of $14$ . In this corpus we use the following five: fop, avrora, xalan, pmd and sunflow (Figure 4).

Because of its age and the fact that we are only using subset of it, our DaCapo benchmark may not be representative. To counter this threat, we uniformly sampled projects from GitHub. We discarded those that did not meet artemis’s constraints, like being equipped with a build system, until we collected $30$ projects. Those projects are diverse, both in domain and size. The selected projects include static analysers, testing frameworks, web clients, and graph processing applications. Their sizes vary from $576$ to $94K$ lines of code with a median of $14881$ . Their popularity varies from [math] to $5642$ stars with a median of $52$ stars per project. The median number of tests is $170$ and median line coverage ratio is $72\%$ .

Collectively, we systematically built these corpora to be representative in order to demontrate the general applicably of the artemis’ optimization framework. The full list of the programs used in this experimental study are available online777https://darwinianoptimiser.com/corpus in the project’s website.

5.2. Experimental Setup

Experiments were conducted using Microsoft AzureTM D4-v2 machines with one Intel E5-2673v3 CPU featuring 8 cores and 14GB of DRAM and built with Oracle JDK 1.8.0 and Ubuntu 16.04.4 LTS.

Performance measurements may lead to incorrect results if not handled carefully (arnold2002online, ). Thus, a statistical rigorous performance evaluation is required (georges2007statistically, ; kalibera2013rigorous, ; lingbo2, ). To mitigate instability and incorrect results, we differentiate VM start-up and steady-state. We ran our experiments in a fresh Azure VM that contained only the JVM and the subject. We use JUnit, which runs an entire test suite in a single JVM. We manually identified and dropped startup runs, then we spot-checked the results to confirm that the rest of the runs achieved a steady state and were exhibiting low variance. All of the means and medians we reported fall within the computed interval with $95\%$ confidence. To assure the accuracy and reduce the bias in the measurement, program profiling period was set as $0.1$ seconds, and each generated solution was run for more than $30$ simulations. Also we use Mann Whitney U test (fay2010wilcoxon, ) to examine if the improvement is statistically significant.

To measure the memory consumption and CPU usage of a subject program, we use the popular JConsole profiler888http://openjdk.java.net/tools/svc/jconsole/ because it directly handles jdk statistics and provides elegant api. We extended JConsole to monitor only those system processes belonging to the test suite. We use Maven Surefire plugin999http://maven.apache.org/components/surefire/maven-surefire-plugin/ to measure the test suite’s execution time because it reports only the execution time of each individual test, excluding the measurement overhead that other Maven plugins may introduce.

For the optimiser, we chose an initial population size of $30$ and a maximum number of $900$ function evaluations. We used the tournament selection (based on ranking and crowding distance), simulated binary crossover (with crossover probability $0.8$ ) and polynomial mutation (with the mutation probability $0.1$ ). We determined these settings from calibration trials to ensure the maturity of the results. Since NSGA-II is stochastic, we ran each experiment $30$ times to obtain statistical significant results.

5.3. Research Questions and Results Analysis

artemis aims to improve all objectives at the same time. Therefore the first research question we would like to answer is:

RQ1: What proportion of programs does artemis improve?

To answer RQ1, we applied artemis to our corpus. We inspected the generated optimal solutions from $30$ runs of each subject by examining the dominate relation between the optimal and inital solutions regarding the optimisation objectives. We introduce the terms strictly dominate relation and non-dominated relation to describe the relation. Defined by Zitzler et al. (1197687, ), a solution strictly dominates another solution if it outperforms the latter in all measures. A solution is non-dominated with another solution if both outperform the other in at least one of the measures.

For DaCapo, artemis found at least one strictly dominant solution for $4$ out of $5$ projects; it found no such solution for sunflow. It found $1072$ solutions, from which $3\%$ are strictly dominant (median is $5.5$ solutions per project) and $64\%$ are non-dominated (median is $18$ solutions per project).

For the popular Github projects, artemis found at least one strictly dominant solution for all $8$ projects. The total number of solutions found is $10218$ and $16\%$ of them are strictly dominant (median is $50$ solutions per project) and $59\%$ are non-dominated (median is $749.5$ solutions per project).

For the sampled Github projects, artemis found a strictly dominant solution for $25$ out of $30$ projects, but found no solution for projects rubix-verifier, epubcheck, d-worker, telegrambots and fqueue. It found $27503$ of which $10\%$ of them are strictly dominant (median is $24$ solutions per project) and $66\%$ are non dominant (median is $125$ solutions per project). With these results, we answer $RQ1$ affirmatively:

Finding1: artemis finds optimised variants that outperform the original program in at least one measure for all programs in our representative corpus.

This finding understates artemis’s impact. Not only did it improve at least one measure for all programs, artemis found solutions that improve all measures for $88\%$ of the programs.

Having found that artemis finds performance improvements, we ask "How good are these improvements" with:

RQ2: What is the average improvement that artemis provides for each program?

Though artemis aims to improve all candidate’s measures, it cannot achieve that if improvements are antagonistic. In some domains, it is more important to significantly improve one of the measures than to improve slightly all measures; e.g., a high frequency trading application may want to pay the cost of additional memory overhead in order to improve the execution time. Our intuition is that the optimiser will find many solutions on the Pareto-front and at least one of them will improve each measure significantly.

We answer $RQ2$ quantitatively. We report the maximum improvement (median value with $95\%$ confidence interval) for execution time, memory and CPU usage for each subject of the three corpora. We use bar charts with error bars to plot the three measures for each program. In Y axis, we represent the percentage of improvement for each measure. A value less than $100\%$ represents an improvement and a value greater than $100\%$ means degradation; e.g., $70\%$ memory consumption implies that the solution consumes $70\%$ of the memory used in the input program.

Selected popular GitHub programs. Figure 3a presents the three measures of the solutions when the execution time is minimised, for each program from the popular GitHub programs. We observe that artemis improves the execution time of every program. google-http-java-client’s execution time was improved the most; its execution time was reduced by M= $46$ %, $95$ % CI [ $45.6$ %, $46.3$ %]. We also notice that this execution time improvement did not affect negatively the other measures, but instead the CPU usage was reduced by M= $41.6$ %, $95$ % CI [ $39.6$ %, $43.6$ %] and memory consumption remained almost the same. The other interesting program to notice from this graph is solo, a blogging system written in Java; its execution time improved slightly by $2$ % but its memory consumption increased by $20.2$ %. Finally, for this set of solutions, the median execution time improvement is $14.13$ %, whilst memory consumption slightly increased by $1.99$ % and CPU usage decreased by $3.79$ %. For those programs, artemis extracted a median of $12$ data structures and the optimal solutions had a median of $4$ data structures changes from the original versions of the program.

Figure 3b shows the solutions optimised for memory consumption. We notice that artemis improves the memory consumption for all programs, with a median value of $14$ %. The execution time was improved by a median value of $2.8$ % for these solutions, while the median value of CPU usage is slightly increased by $0.4$ %. We notice that solo has the best improvement by M= $31.1$ %, $95$ % CI [ $29.3$ %, $33$ %], but with an increase of M= $8.7$ %, $95$ % CI [ $8.5$ %, $8.9$ %] in execution time and M= $21.3$ %, $95$ % CI [ $20.6$ %, $22$ %] in CPU usage. Graphjet, a real-time graph processing library, has the minimum improvement of M= $0.9$ %, $95$ % CI [ $0.6$ %, $1.1$ %]. The optimal solutions had a median of $4$ data structures changes per solution.

Figure 3c presents solutions optimised for CPU usage. The median CPU usage improvement is $9.7$ %. The median value of execution time improved by $5.2$ % and the median value of memory consumption improved by $2.3$ %. The program with the most significant improvement in CPU is http-java-client with M= $49.7$ %, $95$ % CI [ $48$ %, $51.4$ %], but with a decrease in memory of M= $9.8$ %, $95$ % CI [ $7.5$ %, $12.9$ %]. The optimal solutions make a median of $5$ data structures changes to the original versions of the program.

DaCapo. Figure 4 presents all solutions optimised for execution time and memory consumption for the DaCapo benchmark. We used only two measures for the DaCapo benchmark as those were the ones built in the benchmark suite. We chose not to extend or edit the profiling method of DaCapo, to avoid the risk of affecting the validity of its existing, well tested profiling process.

artemis found solutions that improve the execution time for every program without affecting significantly the memory consumption, except project xalan which had improvement (M= $4.8$ %, 95% CI [ $4.6$ %, $5.7$ %] in execution time but with an increase ( $5.8$ %, 95% CI [ $3.5$ %, $7$ %]) in memory consumption. All solutions for optimised memory consumption did not affect execution time, except for a slight increase for program fop. Finally, for this set of solutions, the median percentage of execution time improvement is $4.8$ %, and $4.6$ % for memory consumption. For this set of programs, artemis extracted a median of $18$ data structures per program, and the optimal solutions had a median of $5$ data structures changes for the execution time optimised solutions and $4$ for the memory optimised solutions.

Sampled GitHub programs. Figure 5 presents all solutions optimised for execution time, memory consumption and CPU usage for the sampled GitHub programs. As with the previous corpora, artemis found solutions that improved each measure significantly. artemis improves the median value of execution time across all projects by $4.6$ %, memory consumption by $11.4$ % and CPU usage by $4.6$ %.

artemis found solutions with antagonistic improvement for projects jafka and documents4j. artemis found a solution that improves the execution time of jafka, a distributed publish-subscribe messaging system, by M= $12$ %, $95$ % CI [ $11.2$ %, $13.6$ %], but also increases its memory consumption by M= $23.6$ %, $95$ % CI [ $21.4$ %, $25.7$ %]. It also found a solution that improves the memory consumption of documents4j (M= $38$ %, $95$ % CI [ $38$ %, $41$ %]) but introduced extra CPU usage M= $26.1$ %, $95$ % CI [ $24.2$ %, $28$ %]. A median of $9.5$ data structures were extracted and the optimal solutions had a median of $5$ data structures changes from the original versions of the program.

Observing again the numbers across the three corpora, we can say that they are quite consistent, showing that artemis finds optimal solutions that improve significantly the different optimisation measures. We also see that the number of Darwinian Data structures extracted (between $9.5$ and $18$ ) and the optimal solutions DDS changes (between $4$ and $5$ ) are quite similar for the three corpora.

Analysing all results from the $3$ corpora we conclude the discussion of RQ2 with:

Finding2: artemis improves the median across all programs in our corpus by $4.8\%$ execution time, $10.2\%$ memory consumption, and $5.1\%$ CPU usage.

RQ3: Which Darwinian data structures does artemis find and tune?

We ask this question to understand which changes artemis makes to a program. Table 2 contains the transformations artemis applied across all optimal solutions. We see that the most common transformation for all measures is replacing ArrayList with LinkedList, it appears $91$ , $86$ and $87$ times respectevely across all measures. This transformation indicates that most developers prefer to use ArrayList in their code, which in general is considered to be faster, neglecting use cases in which LinkedList performs better; e.g., when the program has many list insertion or removal operations. Except HashMap to LinkedHashMap, the other transformations happen relatively rare in the optimal solutions. Last, the median number of lines Artemis changes is $5$ .

Finding3: artemis extracted a median of $12$ Darwinian data structures from each program and the optimal solutions had a median of $5$ data structure changes from the original versions of the program.

RQ4: What is the cost of using artemis?

In order for artemis to be practical and useful in real-world situations, it is important to understand the cost of using it. The aforementioned experimental studies reveal that, even for the popular programs, the existing selection of the data structure and the setting of its parameters may be sub-optimal. Therefore, optimising the data structures and their parameters can still provide significant improvement on non-functional properties. To answer this research question, the cost of artemis for optimising a program is measured by the cost of computational resources it uses. In this study, we used a Microsoft AzureTM D4-v2 machine, which costs £ $0.41$ per hour at a standard Pay-As-You-Go rate101010https://azure.microsoft.com/en-gb/pricing/, to conduct all experiments.

The experiments show that an optimisation process takes $3.05$ hours on average for all studied subjects. The program GraphJet and jimfs are the most and the least time-consuming programs respectively, with $19.16$ hours and $3.12$ minutes optimisation time. Accordingly, the average cost of applying artemis for the subjects studied is £ $1.25$ , with a range from £ $0.02$ to £ $7.86$ . The experimental results show that overall cost of using artemis is negligible compared to a human software engineer, with the assumption that a competent software engineer can find those optimisation in a reasonable time.

artemis transforms the selection of data structure and sets parameters by rewriting source code, thereby allowing human developers to easily investigate its changes and gain insight about the usage of data structures and the characteristics of the program.

Finding4: The cost of using artemis is negligible, with an average of £ $1.25$ per project, providing engineers with insights about the optimal variants of the project under optimisation.

To show the versatility of the artemis framework, we ask RQ2, RQ3 and RQ4 over Google guetzli, a very popular JPEG encoder written in C++. We used the STL containers and their operations as Darwinian data structures. More specifically, we considered the push_back and emplace_back as equivalent implementations of the same functionality and exposed those as tunable parameters to artemis’s optimiser. We collected a random sample of images (available online 111111http://darwinianoptimiser.com/corpus) and used it to construct a performance suite that evaluates the execution time of guetzli.

We answer RQ2 by showing that artemis found an optimal solution that improves execution time by $7\%$ . We answer RQ3 by showing that artemis extracted and tuned $25$ parameters and found an optimal solution with $11$ parameter changes. artemis spent $1.5$ hours (costs £ $0.62$ ) to find optimal solutions which is between the limits reported in RQ4. Last, we spent approximately $4$ days to extend artemis to support C++, using the parserless mode.

6. Threats to Validity

Section 5.1 discusses the steps we took to address the threats to the external validity of the results we present here. In short, we built three subcorpora, each more representative than the last, for a total of $43$ programs, diverse in size and domain. The biggest threat to the internal validity of our work is the difficulty of taking accurate performance measurements of applications running on VM, like the JVM. Section 5.2 details the steps, drawn from best practice, we took to address this threat. In essence, we conducted calibration experiments to adjust the parameters such that the algorithm converges quickly and stops after the results become stable. For measuring the non-functional properties, we carefully chose JConsole profiler that directly gathers runtime information from jdk, such that the measurement error is minimised. Moreover, we carefully tuned JConsole to further improve the precision of the measurements by maximising its sampling frequency such that it does not miss any measurements while minimising the CPU overhead. To cater for the stochastic nature of artemis and to provide the statistic power for the results, we ran each experiment $30$ times and manually checked that experiments had a steady state and exhibited low variance.

7. Related Work

Multi-objective Darwinian Data Structure selection and optimisation stands between two areas: search-based software engineering and data structure performance optimisation.

7.1. Search-based software engineering

Previous work applies Genetic Programming (poli2008field, ; lingbo6, ; petke2017genetic, ; lingbo1, ; fan1, ; fan2, ) to either improve the functionality (bug fixing) (6227211, ; lingbo5, ) or non-functional properties of a program (Bruce:2015:REC:2739480.2754752, ; petke2014using, ; langdon2014improving, ; fan1, ; fan2, ; fan3, ). Their approaches use existing code as the code base and replace some of the source code in the program under optimisation with the code from the code base. However, many of these approaches rely on the Plastic Surgery Hypothesis (Barr:2014:PSH:2635868.2635898, ), which assumes that the solutions exist in the code base. artemis, on the other hand, does not rely on this hypothesis. artemis can harvest Darwinian data structures both from the program, but also from external code repositiories; further, artemis relies on a set of transformation rules that it can automatically exhaustively extract from library code or documentation.

Wu et al. (fan2015, ) introduced a mutation-based method to expose “deep” parameters, similar to those we optimise in this paper, from the program under optimisation, and tuned these parameters along with “shallow” parameters to improve the time and memory performance of the program. Though the idea of exposing additional tunable parameter is similar to artemis, their approach did not optimise data structure selection, which can sometimes be more rewarding than just tuning the parameters. Moreover, they applied their approach to a memory management library to benefit that library’s clients. The extent of improvement usually depends on how much a program relies on that library. In contrast, artemis directly applies to the source code of the program, making no assumptions about which libraries the program uses, affording artemis much wider applicability.

7.2. Data structure optimisation and bloat

A body of work (bloat1, ; bloat2, ; bloat3, ; bloat4, ; bloat5, ; nagel2017self, ; basios2017optimising, ) has attempted to identify bloat arising from data structures. In 2009, Shacham et al. (Shacham:2009:CAS:1542476.1542522, ; Shacham:2009:CAS:1543135.1542522, ) introduced a semantic profiler that provides online collection-usage semantics for Java programs. They instrumented Java Virtual Machine (JVM) to gather the usage statistics of collection data structures. Using heuristics, they suggest a potentially better choice for a data structure for a program.

Though developers can add heuristics, if they lack sufficient knowledge about the data structures, they may bias the heuristics and jeopardise the effectiveness of the approach. artemis directly uses the performance of a data structure profiled against a set of performance tests to determine the optimal choices of data structures. Therefore, artemis does not depend on expert human knowledge about the internal implementation and performance differences of data structures to formulate heuristics. Instead. artemis relies on carefully-chosen performance tests to minimse bias. Furthermore, artemis directly modifies the program instead of providing hints, thus users can use the fine-tuned program artemis generates without any additional manual adjustment.

Other frameworks provide users with manually or automatically generated selection heuristics to improve the data structure selection process. JitDS (jitds, ) exploits declarative specifications embedded by experts in data structures to adapt them. CollectionSwitch (costa2018collectionswitch, ) uses data and user-defined performance rules to select other data structure variants. Brainy (jung2011brainy, ) provides users with machine learning cost models that guide the selection of data structures. artemis does not require expert annotations, user-defined rules or any machine learning knowledge. Storage strategies (storagestrategies, ) changes VMs to optimize their performance on collections that contain a single primitive type; Artemis rewrites source code and handles user-defined types and does not make VM modifications.

In 2014, Manotas et al. (Manotas:2014:SSE:2568225.2568297, ) introduced a collection data structure replacement and optimisation framework named SEEDS. Their framework replaces the collection data structures in Java applications with other data structures exhaustively and automatically select the most energy efficient one to improve the overall energy performance of the application. Conceptually artemis extends this approach to optimise both the data structures and their initialization parameters. artemis also extends the optimisation objectives from single objective to triple objectives and used Pareto non-dominated solutions to show the trade-offs between these objectives. Due to a much larger search space in our problem, the exhaustive exploration search that used by SEEDS is not practical, therefore we adopted meta-heuristic search.

Furthermore, artemis directly transforms the source code of the programs whilst SEEDS transforms the bytecode, so artemis provides developers more intuitive information about what was changed and teaches them to use more efficient data structures. Moreover, artemis can be more easily applied to other languages as it does not depend on language specific static analysers and refactoring tools such as WALA (wala, ) and Eclipse IDE’s refactoring tools. In order to support another language we just need the grammar of that language and to implement a visitor that extracts a program’s Darwinian data structures. We note that antlr, which artemis uses, currently provides many available grammar languages 121212https://github.com/antlr/grammars-v4/.

Apart from the novelties mentioned above, this is the largest empirical study to our knowledge compared to similar work. In the studies mentioned above, only $4$ to $7$ subjects were included in the experiments. Our study included the DaCapo benchmark, $30$ sampled Github subjects and $8$ well-written popular subjects to show the effectiveness of artemis, therefore our results are statistically more meaningful.

8. Conclusion

Developers frequently use underperformed data structures and forget to optimise them with respect to some critical non-functional properties once the functionalities are fulfilled. In this paper, we introduced artemis, a novel multi-objective multi-language search-based framework that automatically selects and optimises Darwinian data structures and their arguments in a given program. artemis is language agnostic, meaning it can be easily adapted to any programming language; extending artemis to support C++ took approximately $4$ days. Given as input a data structure store with Darwinian implementations, it can automatically detect and optimise them along with any additional parameters to improve the non-functional properties of the given program. In a large empirical study on $5$ DaCapo benchmarks, $30$ randomly sampled projects and $8$ well-written popular Github projects, artemis found strong improvement for all of them. On extreme cases, artemis found $46\%$ improvement on execution time, $44.9\%$ improvement on memory consumption, and $49.7\%$ improvement on CPU usage. artemis found such improvements making small changes in the source code; the median number of lines artemis changes is $5$ . Thus, artemis is practical and can be easily used on other projects. At last, we estimated the cost of optimising a program in machine hours. With a price of £ $0.41$ per machine hour, the cost of optsimising any subject in this study is less than £ $8$ , with an average of £ $1.25$ . Therefore, we conclude that artemis is a practical tool for optimising the data structures in large real-world programs.

Acknowledgements

We would like to thank Graham Barrett, David Martinez, Kenji Takeda and Nick Page for their invaluable assistance with respect to developing artemis. Lastly, we are grateful to Microsoft Azure and Microsoft Research for the resources and commercial support.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Arnold, M. Hind, and B. G. Ryder. Online feedback-directed optimization of java. In ACM SIGPLAN Notices , volume 37, pages 111–129. ACM, 2002.
2[2] E. T. Barr, Y. Brun, P. Devanbu, M. Harman, and F. Sarro. The plastic surgery hypothesis. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , FSE 2014, pages 306–317, New York, NY, USA, 2014. ACM.
3[3] M. Basios, L. Li, F. Wu, L. Kanthan, and E. T. Barr. Optimising darwinian data structures on google guava. In International Symposium on Search Based Software Engineering , pages 161–167. Springer, 2017.
4[4] R. V. Binder. Testing object-oriented systems: models, patterns, and tools . Addison-Wesley Professional, 2000.
5[5] S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. Mc Kinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, et al. The dacapo benchmarks: Java benchmarking development and analysis. In ACM Sigplan Notices , volume 41, pages 169–190. ACM, 2006.
6[6] C. F. Bolz, L. Diekmann, and L. Tratt. Storage strategies for collections in dynamically typed languages. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications , OOPSLA ’13, pages 167–182, New York, NY, USA, 2013. ACM.
7[7] N. Brown, Y. Cai, Y. Guo, R. Kazman, M. Kim, P. Kruchten, E. Lim, A. Mac Cormack, R. Nord, I. Ozkaya, et al. Managing technical debt in software-reliant systems. In Proceedings of the FSE/SDP workshop on Future of software engineering research , pages 47–52. ACM, 2010.
8[8] B. R. Bruce, J. Petke, and M. Harman. Reducing energy consumption using genetic improvement. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation , GECCO ’15, pages 1327–1334, New York, NY, USA, 2015. ACM.