How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the   OpenStack Cloud Computing Platform

Domenico Cotroneo; Luigi De Simone; Pietro Liguori; Roberto Natella,; Nematollah Bidokhti

arXiv:1907.04055·cs.SE·September 4, 2019

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella,, Nematollah Bidokhti

PDF

1 Repo

TL;DR

This paper empirically investigates the severity and propagation of software failures in OpenStack, revealing many failures go undetected and can silently spread, highlighting the need for improved runtime checks.

Contribution

It provides an empirical analysis of failure impacts in OpenStack, emphasizing failure detection issues and propagation, and suggests improvements for fault containment.

Findings

01

Most failures are not detected promptly

02

Failures can silently propagate across components

03

Run-time checks need enhancement

Abstract

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Tables3

Table 1. Table 1 . Assertion check failures.

Name	Description
FAILURE IMAGE ACTIVE	The created image does not transit into the ACTIVE state
FAILURE INSTANCE ACTIVE	The created instance does not transit into the ACTIVE state
FAILURE SSH	It is impossible to establish a ssh session to the created instance
FAILURE KEYPAIR	The creation of a keypair fails
FAILURE SECURITY GROUP	The creation of a security group and rules fails
FAILURE VOLUME CREATED	The creation of a volume fails
FAILURE VOLUME ATTACHED	Attaching a volume to an instance fails
FAILURE FLOATING IP CREATED	The creation of a floating IP fails
FAILURE FLOATING IP ADDED	Adding a floating IP to an instance fails
FAILURE PRIVATE NETWORK ACTIVE	The created network resource does not transit into the ACTIVE state
FAILURE PRIVATE SUBNET CREATED	The creation of a subnet fails
FAILURE ROUTER ACTIVE	The created router resource does not transit into the ACTIVE state
FAILURE ROUTER INTERFACE CREATED	The creation of a router interface fails

Table 2. Table 2 . Statistics on API Error latency.

	Subsys.	Avg [s]	50^th %ile [s]	90^th %ile [s]
API Errors after an Assertion failure	Nova	152.25	168.34	191.60
	Cinder	74.52	93.00	110.00
	Neutron	144.72	166.00	263.60
API Errors only	Nova	3.73	0.21	0.55
	Cinder	0.30	0.01	1.00
	Neutron	0.30	0.01	1.00

Table 3. Table 3 . Logging coverage of high-severity log messages.

	Logging coverage
Subsystem	API Errors
only	Assertion
failure only	Assertion failure and API Errors
Nova	90.32%	80.77%	82,56%
Cinder	100%	95,65%	100%
Neutron	98.67%	66.67%	95%

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dessertlab/OpenStack-Fault-Injection-Environment
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Domenico Cotroneo

Federico II University of NaplesItaly

[email protected]

,

Luigi De Simone

Federico II University of NaplesItaly

[email protected]

,

Pietro Liguori

Federico II University of NaplesItaly

[email protected]

,

Roberto Natella

Federico II University of NaplesItaly

[email protected]

and

Nematollah Bidokhti

Futurewei Technologies, Inc.USA

[email protected]

(2019)

Abstract.

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Bug analysis; Fault injection; OpenStack;

††copyright: acmcopyright††journalyear: 2019††conference: Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; August 26–30, 2019; Tallinn, Estonia††booktitle: Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19), August 26–30, 2019, Tallinn, Estonia††price: 15.00††doi: 10.1145/3338906.3338916††isbn: 978-1-4503-5572-8/19/08††ccs: Software and its engineering Software fault tolerance††ccs: Software and its engineering Software testing and debugging††ccs: Software and its engineering Software reliability††ccs: Computer systems organization Cloud computing

1. Introduction

Cloud management systems, such as OpenStack (OpenStack, 2018), are a fundamental element of cloud computing infrastructures. They provide abstractions and APIs for programmatically creating, destroying and snapshotting virtual machine instances; attaching and detaching volumes and IP addresses; configuring security, network, topology, and load balancing settings; and many other services to cloud infrastructure consumers. It is very difficult to avoid software bugs when implementing such a rich set of services: at the time of writing, the OpenStack project codebase consists of more than 9 million lines of code (LoC) (Black Duck Software, Inc., 2018; OpenStack project, 2018c), which implies thousands of residual software bugs even under the most optimistic assumptions on the bugs-per-LoC density (McConnell, 2004; Tanenbaum et al., 2006). As a result of these bugs, many high-severity failures have been occurring in cloud infrastructures of popular providers, causing outages of several hours and the unrecoverable loss of user data (Li et al., 2018; Musavi et al., 2016; Gunawi et al., 2014; Gunawi et al., 2016).

In order to prevent severe failures, software developers invest efforts in mitigating the consequences of residual bugs. Examples are defensive programming practices, such as assertion checking and logging, to timely detect an incorrect state of the system (Lyu, 2007; Florio and Blondia, 2008) and for providing to system operators useful information for quick troubleshooting (Yuan et al., 2012b; Yuan et al., 2012a; Farshchi et al., 2018). Another important approach to mitigate failures is to implement fault containment strategies. Examples are i) interrupting a service as soon as a failure occurs (i.e., a fail-stop behavior), by turning high-severity failures, such as data losses, into lower-severity API exceptions that can be gracefully be handled (Candea and Fox, 2003; Swift et al., 2006; Oppenheimer et al., 2003); ii) notifying the cloud management system and operators about the failures through error logs, so that they can diagnose issues and undertake recovery actions, such as restoring a previous state checkpoint or backup (Weber et al., 2012; Fu et al., 2016); iii) separating system components across different domains to prevent cascading failures across components (Lee and Iyer, 1993; Arlat et al., 2002; Herder et al., 2009).

In this paper, we aim to empirically analyze the impact of high-severity failures in the context of a large-scale, industry-applied case study, to pave the way for failure mitigation strategies in cloud management systems. In particular, we analyze the OpenStack project, which is the basis for many commercial cloud management products (OpenStack project, 2018b) and is widespread among public cloud providers and private users (OpenStack project, 2018d). Moreover, OpenStack is a representative real-world large software system, which includes several sub-systems for managing instances (Nova), volumes (Cinder), virtual networks (Neutron), etc., and orchestrates them to deliver rich cloud computing services.

We adopt software fault injection to accelerate the occurrence of failures caused by software bugs (Christmansson and Chillarege, 1996a; Voas et al., 1997; Natella et al., 2016): our approach deliberately injects bugs in one of the system components and analyzes the reaction of the cloud system in terms of fail-stop behavior, failure reporting through error logs, and failure propagation across components. We based fault injection on information on software bugs reported by OpenStack developers and users (OpenStack project, 2018a), in order to characterize frequent bug patterns occurring in this project. Then, we performed a large fault injection campaign on the three major subsystems of OpenStack (i.e., Nova, Cinder, and Neutron), for a total of 911 experiments. The analysis of fault injections pointed out the impact of the injected bugs on the end-users (e.g., service unavailability and resource inconsistencies) and on the ability of the system to recover and to report about the failure (e.g., the contents of log files, and the error notifications raised by the OpenStack service API). Results of the experimental campaign revealed the following findings:

•

In the majority of the experiments (55.8%), OpenStack failures were not mitigated by a fail-stop behavior, leaving resources in an inconsistent state (e.g., instances were not active, volumes were not attached) unbeknownst to the user; In the 31.3% of these failures, the problem was never notified to the user through exceptions; the others were only notified after a long delay (longer than 2 minutes on average). This behavior threatens data integrity during the period between the occurrence of the failure and its notification (if any) and hinders failure recovery actions.

•

In a small fraction of the experiments (8.5%), there was no indication of the failure in the logs. These cases represent a high risk for system operators since they lack clues for understanding the failure and restoring the availability of services and resources;

•

In most of the failures (37.5%), the injected bugs propagated across several OpenStack components. Indeed, 68.3% of these failures were notified by a different component from the injected one. Moreover, there were relevant cases of failures that caused subtle residual effects on OpenStack (7.5%): even after removing the injected bug from OpenStack, cleaning-up all virtual resources, and restarting the workload on a set of new resources, the OpenStack services were still experiencing a failure, that could only be recovered by fully restarting the OpenStack platform and restoring its internal database from a backup.

These results point out the risk that failures are not timely detected and notified, and that they can silently propagate through the system. Based on this analysis, we identify a set of directions towards more reliable cloud management system. To support future research in this field, we share an artifact for configuring our fault injection environment inside a virtual machine, and our dataset of failures, which includes the injected faults, the workload, the effects of the failures (both the user-side impact and our own in-depth correctness checks), and the error logs produced by OpenStack.

In the following, Section 2 elaborates on the research problem; Section 3 describes our methodology; Section 4 presents experimental results; Section 5 discusses related work; Section 6 includes links to the artifacts to support future research; Section 7 concludes the paper.

2. Overview on the research problem

Mitigating the severity of software failures caused by residual bugs is a relevant issue for high-reliability systems (Cotroneo et al., 2013), yet it still represents an open research challenge. Ideally, in the case that a fault occurs, a service should be able to mask the fault or recover from it in a transparent way to the user, such as, by leveraging redundancy. However, this is often not possible in the case of software bugs. Since software bugs are human mistakes in the source code, the traditional fault-tolerance strategies for hardware and network faults often do not apply. For example, if a service is broken because of a regression bug, then retrying to execute the service API with the same inputs would result again in a failure; a retrial would only succeed in the case that the software bug is triggered by a transient condition, such as a race condition (Gray, 1986; Grottke and Trivedi, 2007; Carrozza et al., 2013). If recovery is not possible, the failed operation must be necessarily aborted and the user should be notified (Netflix Inc., 2017; Microsoft Corp., 2017b), so that the failure can be handled at a higher level of the business logic. For example, the end-user can skip the failed operation, or put on hold the workflow until the bug is fixed. If the failure does not immediately generate an exception from the OS or from the programming language run-time, the service may continue its faulty execution until it corrupts in subtle ways the results or the state of resources. Such cases need to be mitigated by architecting the software into small, de-coupled components for fault containment, in order to limit the scope of failure (e.g., the bulkhead pattern (Netflix Inc., 2017; Microsoft Corp., 2017a)); and by applying defensive programming practices to perform redundant checks on the correctness of a service (e.g., pre- and post-conditions to check that a resource has indeed been allocated or updated). In this way, the system can enforce a fail-stop behavior of the service (e.g., interrupting an API call that experiences a failure, and generating an exception), so that it can avoid data corruption and limit the outage to a small part of the system (e.g., an individual service call).

In this work, we study the extent of this problem in the context of a cloud management system. Applying software fault tolerance principles in such a large distributed system is difficult since its design and implementation is a trade-off between several objectives, including performance, backward compatibility, programming convenience, etc., which opens to the possibility of failure propagation beyond fault containment limits. We investigate this problem from three perspectives, by addressing the following three perspectives.

$\rhd$ In the case that service experiences a failure, is it able to exhibit a fail-stop behavior? If a service request could not be completed because of a failure, the service API should return an exception to inform about the issue. Therefore, we experimentally evaluate whether the service indeed halts on failure and whether the failure is explicitly notified to the user. In the worst case, the service API neither halts nor raises an exception, and the state of resources is inconsistent with respect to what the user is expecting (e.g., a VM instance was not actually created, or is indefinitely in the “building” state).

$\rhd$ Are error reporting mechanisms able to point out the occurrence of a failure? Error logs are a valuable source of information for automated recovery mechanisms and system operators to detect failures and restore service availability; and for developers to investigate the root cause of the failure. However, there can be gaps between failures and log messages. We analyze the cases in which the logs do not record any anomalous event related to a failure, since the software may lack checks to detect the anomalous events.

$\rhd$ Are failures propagated across the services of the cloud management system? To mitigate the severity of failures, it is desirable that failure is limited to the specific service API that is affected by a software bug. If the failure impacts other services beyond the buggy one (e.g., the incorrect initialization of a VM instance also causes the failure of subsequent operations on the instance), it is more difficult to identify the root cause of the problem and to recover from the failure. Similarly, the failure may cause lasting effects on the cloud infrastructures (e.g., the virtual resources allocated for a failed instance cannot be reclaimed, or interfere with other resource allocations) that are difficult to debug and to recover from. Therefore, we analyze whether failures can spread across different components of the system, and across several service calls.

3. Methodology

Our approach is to inject software bugs (§ 3.1, § 3.2) in order to obtain failure data from OpenStack (§ 3.3). Then, we analyze whether the system could gracefully mitigate the impact of the failures (§ 3.4).

3.1. Bug analysis

A key aspect to perform software fault injection experiments is to inject representative software bugs (Christmansson and Chillarege, 1996a; Duraes and Madeira, 2006). Since the body of knowledge on bugs in Python software (Rodríguez-Pérez et al., 2018; Orrú et al., 2015), the programming language of OpenStack, is relatively smaller compared to other languages, we seek for more insights about bugs in the OpenStack project. Therefore, we analyzed the OpenStack issue tracker on the Launchpad portal (OpenStack, 2018a), by looking for bug-fixes at the source code level, in order to identify bug patterns (Duraes and Madeira, 2006; Pan et al., 2009; Martinez et al., 2013; Zhong and Meng, 2018; Tufano et al., 2018) for this project. From these patterns, we defined a set of bug types to be injected.

We went through the problem reports and inspected the related source code. We looked for reports where: (i) the root cause of the problem was a software bug, excluding build, packaging and installation issues; (ii) the problem had been marked with the highest severity level (i.e., the problem has a strong impact on OpenStack services); (iii) the problem was fixed, and the bug-fix was linked to the discussion. We manually analyzed a sample of 179 problem reports from the Launchpad, focusing on entries with importance set to “Critical”, and with status set to “Fix Committed” or “Fix Released” (such that the problem report also includes a final solution shipped in OpenStack). Of these problem reports, we identified 113 reports that met all of the three criteria. We shared the full set of bug reports (see Section 6).

The bugs encompass several areas of OpenStack, including: bugs that affected the service APIs exposed to users (e.g., nova-api); bugs that affected dictionaries and arrays, such as a wrong key used in image[’imageId’]; bugs that affected SQL queries (e.g., database queries for information about instances in Nova); bugs that affected RPC calls between OpenStack subsystems (e.g., rpc.cast was omitted, or had a wrong topic or contents); bugs that affected calls to external system software, such as iptables and dsnmasq; bugs that affected pluggable modules in OpenStack, such as network protocol plugins and agents in Neutron. Figure 1 shows statistics about the bug types that we identified from the problem reports and their bug-fixes. The five most frequent bug types include the following ones.

$\blacksquare$ Wrong parameters value: The bug was an incorrect method call inside OpenStack, where a wrong variable was passed to the method call. For example, this was the case of the Nova bug #1130718 (https://bugs.launchpad.net/nova/+bug/1130718, which was fixed in https://review.openstack.org/#/c/22431/ by changing the exit codes passed through the parameter check_exit_code).

$\blacksquare$ Missing parameters: A method call was invoked with omitted parameters (e.g., the method used a default parameter instead of the correct one). For example, this was the case of the Nova bug #1061166 (https://bugs.launchpad.net/nova/+bug/1061166, which was fixed in https://review.openstack.org/#/c/14240/ by adding the parameter read_deleted=’yes’ when calling the SQL Alchemy APIs).

$\blacksquare$ Missing function call: A method call was entirely omitted. For example, this was the case of the Nova bug #1039400 (https://bugs.launchpad.net/nova/+bug/1039400, which was fixed in https://review.openstack.org/#/c/12173/ by adding and calling the new method

trigger_security_group_members_refresh).

$\blacksquare$ Wrong return value: A method returned an incorrect value (e.g., None instead of a Python object). For example, this was the case of the Nova bug #855030 (https://bugs.launchpad.net/nova/+bug/855030, which was fixed in https://review.openstack.org/#/c/1930/ by returning an object allocated through allocate_fixed_ip).

$\blacksquare$ Missing exception handlers: A method call lacks exception handling. For example, this was the case of the Nova bug #1096722 (https://bugs.launchpad.net/nova/+bug/1096722, which was fixed in https://review.openstack.org/#/c/19069/ by adding an exception handler for exception.InstanceNotFound).

3.2. Fault injection

In this study, we perform software fault injection to analyze the impact of software bugs (Voas et al., 1997; Christmansson and Chillarege, 1996a; Natella et al., 2016). This approach deliberately introduces programming mistakes in the source code, by replacing parts of the original source code with faulty code. For example, in Figure 2, the injected bug emulates a missing optional parameter (a port number) to a function call, which may cause failure under certain conditions (e.g., a VM instance may not be reachable through an intended port). This approach is based on previous empirical studies, which observed that the injection of code changes can realistically emulate software faults (Daran and Thévenod-Fosse, 1996; Christmansson and Chillarege, 1996a; Andrews et al., 2005), in the sense that code changes produce run-time errors that are similar to the ones produced by real software faults. This approach is motivated by the high efforts that would be needed for experimenting with hand-crafted bugs or with real past bugs: in these cases, every bug would require to carefully craft the specific conditions that trigger it (i.e., the topology of the infrastructure, the software configuration, and the hardware devices under which the bug surfaces). To achieve a match between injected and real bugs, we focus the injection on the most frequent five types found by the bug analysis. These bug types allow us to cover all of the main areas of OpenStack (API, SQL, etc.), and suffice to generate a large and diverse set of faults over the codebase of OpenStack.

We emulate the bug types by mutating the existing code of OpenStack. The Figure 2 shows the steps of a fault injection experiment. We developed a tool to automate the bug injection process in Python code. The tool uses the ast Python module to generate an abstract syntax tree (AST) representation of the source code; then, it scans the AST by looking for relevant elements (function calls, expressions, etc.) where the bug types could be injected; it modifies the AST, by removing or replacing the nodes to introduce the bug; finally, it rewrites the modified AST into Python code, using the astunparse Python module. To inject the bug types of Section 3.2, we modify or remove method calls and their parameters. We targeted method calls related to the bugs that we analyzed, by targeting calls to internal APIs for managing instances, volumes, and networks (e.g., which are denoted by specific keywords, such as instance and nova for the methods of the Nova subsystem). Wrong input and parameters are injected by wrapping the target expression into a function call, which returns at run-time a corrupted version of the expression based on its data type (e.g., a null reference in place of an object reference, or a negative value in place of an integer). Exceptions are raised on method calls according to a pre-defined list of exception types.

The tool inserts fault-injected statements into an if block, together with the original version of the same statements but in a different branch (as in step 2 in Figure 2). The execution of the fault-injected code is controlled by a trigger variable, which is stored in a shared memory area that is writable from an external program. This approach has been adopted for controlling the occurrence of failures during the tests. In the first phase (round 1), we enable the fault-injected code, and we run a workload that exercises the service APIs of the cloud management system. During this phase, the fault-injected code could generate run-time errors inside the system, which will potentially lead to user-perceived failures. Afterward, in a second phase (round 2), we disable the injected bug, and we execute the workload for a second time. This fault-free execution points out whether the scope of run-time errors (generated by the first phase) is limited to the service API invocations that triggered the buggy code (e.g., the bug only impacts on local session data). If failures still occur during the second phase, then the system has not able to handle the run-time errors of the first phase. Such failures point out the propagation of effects across the cloud management system (see § 2).

We implemented a workload generator to automatically exercise the service APIs of the main OpenStack sub-systems. The workload has been designed to cover several sub-systems of OpenStack and several types of virtual resources, in a similar way to integration test cases from the OpenStack project (OpenStack, 2018b). The workload creates VM instances, along with key pairs and a security group; attaches the instances to volumes; creates a virtual network, with virtual routers; and assigns floating IPs to connect the instances to the virtual network. Having a comprehensive workload allows us to point out propagation effects across sub-systems caused by bugs.

The experimental workflow is repeated several times. Every experiment injects a different fault, and only one fault is injected per experiment. Before a new experiment, we clean-up any potential residual effect from the previous experiment, in order to be able to relate failure to the specific bug that caused it. The clean-up re-deploys OpenStack removes all temporary files and processes and restores the database to its initial state. However, we do not perform these clean-up operations between the two workload rounds (i.e., no clean-up between the steps 6 and 8 of Figure 2), since we want to assess the impact of residual side effects caused by the bug.

3.3. Failure data collection

During the execution of the workload, we record inputs and outputs of service API calls of OpenStack. Any exception generated from the call (API Errors) is also recorded. In-between calls to service APIs, the workload also performs assertion checks on the status of the virtual resources, in order to point out failures of the cloud management system. In the context of our methodology, assertion checks serve as ground truth about the occurrence of failures during the experiments. These checks are valuable to point out the cases in which a fault causes an error, but the system does not generate an API error (i.e., the system is unaware of the failure state). Our assertion checks are similar to the ones performed by the integration tests as test oracles (Ju et al., 2013a; OpenStack, 2018c): they assess the connectivity of the instances through SSH and query the OpenStack API to check that the status of the instances, volumes and network is consistent with the expectation of the test cases. The assertion checks are performed by our workload generator. For example, after invoking the API for creating a volume, the workload queries the volume status to check if it is available (VOLUME CREATED assertion). These checks are useful to find failures not notified through the API errors. Table 1 describes the assertion checks.

If an API call generates an error, the workload is aborted, as no further operation is possible on the resources affected by the failure (e.g., no volume could be attached if the instance could not be created). In the case that the system fails without raising an exception (i.e., an assertion check highlights a failure, but the system does not generate an API error), the workload continues the execution (as a hypothetical end-user, being unaware of the failure, would do), regardless of failed assertion check(s). The workload generator records the outcomes of both the API calls and of the assertion checks. Moreover, we collect all the log files generated by the cloud management system. This data is later analyzed for understanding the behavior of the system under failure.

3.4. Failure analysis

We analyze fault injection experiments according to three perspectives discussed in Section 2. The first perspective classifies the experiments with respect to the type of failure that the system experiences. The possible cases are the following ones.

$\blacksquare$ API Error: In these cases, the workload was not able to correctly execute, due to an exception raised by a service API call. In these cases, the cloud management system has been able to handle the failure in a fail-stop way, since the user is informed by the exception that the virtual resources could not be used, and it can perform recovery actions to address the failure. In our experiments, the workload stops on the occurrence of an exception, as discussed before.

$\blacksquare$ Assertion failure: In these cases, the failure was not pointed out by an exception raised by a service API. The failure was detected by the assertion checks made by the workload in-between API calls, which found an incorrect state of virtual resources. In these cases, the execution of the workload was not interrupted, as no exception was raised by the service APIs during the whole experiment, and the service API did (apparently) work from the perspective of the user. These cases point out non-fail-stop behavior.

$\blacksquare$ Assertion failure(s), followed by an API Error: In these cases, the failure was initially detected by assertion checks, which found an incorrect state of virtual resources in-between API calls. After the assertion check detected the failure, the workload continued the execution, by performing further service API calls, until an API error occurred in a later API call. These cases also point out issues at handling the failure, since the user is unaware of the failure state and cannot perform recovery actions.

$\blacksquare$ No failure: The injected bug did not cause a failure that could be perceived by the user (neither by API exceptions nor by assertion checks). It is possible that the effects of the bug were tolerated by the system (e.g., the system switched to an alternative execution path to provide the service); or, the injected source code was harmless (e.g., an uninitialized variable is later assigned before use). Since no failure occurred, these experiments are not further analyzed, as they do not allow to draw conclusions on the failure behavior of the system.

Failed executions are further classified according to a second perspective, with respect to the execution round in which the system experienced a failure. The possible cases are the following ones.

$\rhd$ Failure in the faulty round only: In these cases, a failure occurred in the first (faulty) execution round (Figure 2), in which a bug has been injected; and no failure is observed during the second (fault-free) execution round, in which the injected bug is disabled, and in which the workload operates on a new set of resources. This behavior is the likely outcome of an experiment since we are deliberately forcing a service failure only in the first round through the injected bug.

$\rhd$ Failure in the fault-free round (despite the faulty round): These cases are concerns for fault containment since the system is still experiencing failures despite the bug is disabled after the first round and the workload operates on a new set of resources. This behavior is due to residual effects of the bug that propagated through session state, persistent data, or other shared resources.

Finally, the experiments with failures are classified from the perspective of whether they generated logs able to indicate the failure. In order to make more resilient a system, we are interested in whether it produces information for detecting failures and for triggering recovery actions. In practice, developers are conservative at logging information for post-mortem analysis, by recording high volumes of low-quality log messages that bury the truly important information among many trivial logs of similar severity and contents, making it difficult to locate issues (Zhu et al., 2015; Li et al., 2017; Yuan et al., 2012b). Therefore, we cannot simply rely on the presence of logs to conclude that a failure was detected.

To clarify the issue, Figure 3 shows the distribution of the number of log messages in OpenStack across severity levels, TRACE to CRITICAL, during the execution of our workload generator, and without any failure. We can notice that all OpenStack components generate a large number of messages with severity WARNING, INFO, and DEBUG even when there is no failure. Instead, there are no messages of severity ERROR or CRITICAL. Therefore, even if a failure is logged with severity WARNING or lower, such log messages cannot be adopted for automated detection and recovery of the failure, as it is difficult to distinguish between “informative” messages and actual issues. Therefore, to evaluate the ability of the system to support recovery and troubleshooting through logs, we classify failures according to the presence of one or more high-severity message (i.e., CRITICAL or ERROR) recorded in the log files (logged failures), or no such message (non-logged failures).

4. Experimental results

In this work, we present the analysis of OpenStack version 3.12.1 (release Pike), which was the latest version of OpenStack when we started this work. We injected bugs into the most fundamental services of OpenStack (Denton, 2015; Solberg, 2017): (i) the Nova subsystem, which provides services for provisioning instances (VMs) and handling their life cycle; (ii) the Cinder subsystem, which provides services for managing block storage for instances; and (iii) the Neutron subsystem, which provides services for provisioning virtual networks for instances, including resources such as floating IPs, ports and subnets. Each subsystem includes several components (e.g., the Nova sub-system includes nova-api, nova-compute, etc.), which interact through message queues internally to OpenStack. The Nova, Cinder, and Neutron sub-systems provide external REST API interfaces to cloud users.

Figure 4 shows the testbed used for the experimental analysis of OpenStack. We adopted an all-in-one virtualized deployment of OpenStack, in which the OpenStack services run on the same VM, for the following reasons: (1) to prevent interferences on the tests from transient issues in the physical network (e.g., sporadic network faults, network delays caused by other user traffic in our local data center, etc.); (2) to parallelize a high number of tests on several physical machines, by using the Packstack installation utility (RDO, 2018) to have a reproducible installation of OpenStack across the VMs; (3) to efficiently revert any persistent effect of a fault injection test on the OpenStack deployment (e.g., file system issues), in order to assure independence among the tests. Moreover, the all-in-one virtualized deployment is a common solution for performing tests on OpenStack (Red Hat, Inc., 2018; Markelov, 2016). The hardware and VM configuration for the testbed includes: 8 virtual Intel Xeon CPUs (E5-2630L v3 @ 1.80GHz); 16GB RAM; 150 GB storage; Linux CentOS v7.0.

In addition to the core services of OpenStack (e.g., Nova, Neutron, Cinder, etc.), the testbed also includes our own components to automate fault injection tests. The Injector Agent is the component that analyzes and instruments the source code of OpenStack. The Injector Agent can: (i) scan the source code to identify injectable locations (i.e., source-code statements where the bug types discussed in § 3.2 can be applied); (ii) instrument the source code by introducing logging statements in every injectable location, in order to get a profile of which locations are covered during the execution of the workload (coverage analysis); (iii) instrument the source code to introduce a bug into an individual injectable location.

The Controller orchestrates the experimental workflow. It first commands the Injector Agent to perform a preliminary coverage analysis, by instrumenting the source code with logging statements, restarting the OpenStack services, and launching the Workload Generator, but without injecting any fault. The Workload Generator issues a sequence of API calls in order to stimulate OpenStack services. The Controller retrieves the list of injectable locations and their coverage from the Injector Agent. Then, it iterates over the list of injectable locations that are covered, and issues commands for the Injector Agent to perform fault injection tests. For each test, the Injector Agent introduces an individual bug by mutating the source code, restarts the OpenStack services, starts the workload, and triggers the injected bug as discussed in § 3.2. The Injector Agent collects the logs files from all OpenStack subsystems and from the Workload Generator, which are sent to the Controller for later analysis (§ 3.4).

We performed a full scan of injectable locations in the source code of Nova, Cinder, and Neutron, for a total of 2,016 analyzed source code files. We identified 911 injectable faults that were covered by the workload. Figure 5 shows the number of faults per sub-system and per type of fault. The number of faults for each type and sub-system depends on the number of calls to the target functions, and on their input and output parameters, as discussed in § 3.2. We executed one the test per injectable location, by injecting one fault at a time.

After executing the tests, we found failures respectively in 52.6% (231 out of 439 tests), 46.4% (125 out of 269 tests), and 61% (124 out of 203 tests) of tests in Nova, Cinder, and Neutron, for a total of 480. In the remaining 47.3% of the tests (431 out of 911 tests), instead, there were neither an API error nor assertion failures: in these cases, the fault was not activated (even if the faulty code was covered by the workload), or there was no error propagation to the component interface. The occurrence of tests not causing failures is a typical phenomenon that occurs with code mutations, which may not infect the state even when the faulty code is executed (Christmansson and Chillarege, 1996b; Lanzaro et al., 2014). Yet, the injections provided us a large and diverse set of failures for our analysis.

4.1. Does OpenStack show a fail-stop behavior?

We first analyze the impact of failures on the service interface APIs provided by OpenStack. The Workload Generator (which impersonates a user of the cloud management system) invokes these APIs, looks for errors returned by the APIs and performs assertion checks between API calls. A fail-stop behavior occurs when an API returns an error before any failed assertion check. In such cases, the Workload Generator stops on the occurrence of the API error. Instead, it is possible that an API invocation terminates without returning any error, but leaving the internal resources of the infrastructure (instances, volumes, etc.) in a failed state, which is reported by assertion checks. These cases represent violations of the fail-stop hypothesis, and represent a risk for the users as they are unaware of the failure. To investigate this aspect, we initially focus on the faulty round of each test, in which fault injection is enabled (Figure 2).

Figure 6 shows the number of tests that experienced failures, divided into API Error only, Assertion Failure only, and Assertion Failure(s), followed by an API Error. The figure shows the data divided with respect to the subsystem where the bug was injected (respectively in Nova, Cinder, and Neutron); moreover, Figure 6 shows the distribution across all fault injection tests. We can see that the cases in which the system does not exhibit a fail-stop behavior (i.e., the categories Assertion Failure only and Assertion Failure followed by an API Error) represent the majority of the failures.

Figure 7 shows a detailed perspective on the failures of assertion checks. Notice that the number of assertion is greater than the number of tests classified in the Assertion failure category (i.e., Assertion Failure only and Assertion Failure followed by an API Error) since a test can generate multiple assertion failures. The most common case has been one of the instances not active because the instance creation failed (i.e., it did not move into the ACTIVE state (OpenStack, 2018c)). In other cases, the instance could not be reached through the network or could not be attached to a volume, even if in the ACTIVE state. A further common case is the failure of the volume creation, but only the faults injected in the Cinder sub-system caused this assertion failure.

These cases point out that OpenStack lacks redundant checks to assure that the state of the virtual resources after a service call is in the expected state (e.g., newly-created instances are active). Such redundant checks would assess the state of the virtual resources before and after a service invocation and would raise an error if the state does not comply with the expectation (such as a new instance could not be activated). However, these redundant checks are seldom adopted, most likely due to the performance penalty they would incur, and because of the additional engineering efforts to design and implement them. Nevertheless, the cloud management system is exposed to the risk that residual bugs can lead to non-fail-stop behaviors, where failures are notified with a delay or not notified at all. This makes not trivial to prevent data losses and to automate recovery actions.

Figure 8 provides another perspective on API errors. It shows the number of tests in which each API returned an error, focusing on 15 out of 40 APIs that failed at least one time. The API with the highest number of API errors is the one for adding a volume to an instance (openstack server add volume), provided by the Cinder sub-system. This API generated errors even when faults were injected in Nova (instance management) and Neutron (virtual networking). This behavior means that the effects of fault injection propagated from other sub-systems to Cinder (e.g., if an instance is in an incorrect state, other APIs on that resource are also exposed to failures). On the one hand, this behavior is an opportunity for detecting failures, even if in a later stage. On the other hand, it also represents the possibility of a failure to spread across sub-systems, thus defeating fault containment and exacerbating the severity of the failure. We will analyze fault propagation in more detail in Section 4.3.

To understand the extent of non-fail-stop behaviors, we also analyze the period of time (latency) between the execution of the injected bug and the resulting API error. It is desirable that this latency is as low as possible. Otherwise, the longer the latency, the more difficult is to relate an API error with its root cause (i.e., an API call invoked much earlier, on a different sub-system or virtual resource); and the more difficult to perform troubleshooting and recovery actions. To track the execution of the injected bug, we instrumented the injected code with logging statements to record the timestamp of its execution. If the injected code is executed several times before a failure (e.g., in the body of a loop), we conservatively consider the last timestamp. We consider separately the cases where the API error is preceded by assertion check failures (i.e., the injected bug was triggered by an API different from the one affected by the bug) from the cases without any assertion check failure (e.g., the API error arises from the same API affected by the injected bug).

Figure 9 shows the distributions of latency for API errors that occurred after assertion check failures, respectively for the injections in Nova, Cinder, and Neutron. Table 2 summarizes the average, the 50th, and the 90th percentiles of the latency distributions. We note that in the first category (API errors after assertion checks), all sub-systems exhibit a median API error latency longer than 100 seconds, with cases longer than several minutes. This latency should be considered too long for cloud services with high-availability SLAs (e.g., four nines or more (Bauer and Adams, 2012)), which can only afford few minutes of monthly outage. This behavior points out that the API errors are due to a “reactive” behavior of OpenStack, which does not actively perform any redundant check on the integrity of virtual resources, but only reacts to the inconsistent state of the resources once they are requested in a later service invocation. Thus, OpenStack experiences a long API error latency when a bug leaves a virtual resource in an inconsistent state. This result suggests the need for improved error checking mechanisms inside OpenStack to prevent these failures.

In the case of failures that are notified by API errors without any preceding assertion check failure (the second category in Table 2), the latency of the API errors was relatively small, less than one second in the majority of cases. Nevertheless, there were few cases with an API error latency higher than one minute. In particular, these cases happened when bugs were injected in Nova, but the API error was raised by a different sub-system (Cinder). In these cases, the high latency was caused by the propagation of the bug’s effects across different API calls. These cases are further discussed in § 4.3.

4.2. Is OpenStack able to log failures?

Since failures can be notified to the end-user with a long delay, or even not at all, it becomes important for system operators to get additional information to troubleshoot these failures. In particular, we here consider log messages produced by OpenStack sub-systems.

We computed the percentage (logging coverage) of failed tests which produced at least one high-severity log message (see also § 3.4). Table 4.2 provides the logging coverage for different subsets of failures, by dividing them with respect to the injected subsystem and to the type of failure. From these results, we can see that OpenStack logged at least one high-severity message (i.e., with severity level ERROR or CRITICAL) in most of the cases. The Cinder subsystem shows the best results since logging covered almost all of the failures caused by fault injection. However, in the case of Nova and Neutron, logs missed some of the failures. In particular, the failures without API errors (i.e., Assertion Failure only) exhibited the lowest logging coverage. This behavior can be problematic for recovery and troubleshooting since the failures without API errors lack an explicit error notification. These failures are also the ones in need of complementary sources of information, such as logs.

To identify opportunities to improve logging in OpenStack, we analyzed the failures without any high-severity log across, with respect to the bug types injected in these tests. We found that MISSING FUNCTION CALL and WRONG RETURN VALUE represent the 70.7% of the bug types that lead to non-logged failures (43.9% and 26.8 %, respectively). The WRONG RETURN VALUE faults are the easiest opportunity for improving logging and failure detection since the callers of a function could perform additional checks on the returned value and record anomalies in the logs. For example, one of the injected bugs introduced a WRONG RETURN VALUE in calls to a database API called by the Nova sub-system to update the information linked to a new instance. The bug forced the function to return a None instance object. The bug caused an assertion check failure, but OpenStack did not log any high-severity message. By manually analyzing the logs, we could only find one suspicious message with the only WARNING severity and with little information about the problem, as this message was not related to database management.

The non-logged failures caused by a MISSING FUNCTION CALL emphasize the need for redundant end-to-end checks to identify inconsistencies in the state of the virtual resources. For example, in one of these experiments, we injected a MISSING FUNCTION CALL in the LibvirtDriver class in the Nova subsystem, which allows OpenStack to interact with the libvirt virtualization APIs (libvirt, 2018). Because of the injected bug, the Nova driver omits to attach a volume to an instance, but the Nova sub-system does not perform checks that the volume is indeed attached to the instance. This kind of end-to-end checks could be introduced at the service API interface of OpenStack (e.g., in nova-api) to test the availability of the virtual resources at the end of API service invocations (e.g., by pinging them).

Bibliography81

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Andrews et al . (2005) J.H. Andrews, L.C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In Proc. Intl. Conf. on Software Engineering . 402–411.
3Arlat et al . (2002) Jean Arlat, J-C Fabre, and Manuel Rodríguez. 2002. Dependability of COTS microkernel-based systems. IEEE Transactions on computers 51, 2 (2002), 138–163.
4Bauer and Adams (2012) Eric Bauer and Randee Adams. 2012. Reliability and Availability of Cloud Computing (1st ed.). Wiley-IEEE Press.
5Black Duck Software, Inc. (2018) Black Duck Software, Inc. 2018. The Open Stack Open Source Project on Open Hub. https://www.openhub.net/p/openstack
6Candea and Fox (2003) George Candea and Armando Fox. 2003. Crash-Only Software. In Workshop on Hot Topics in Operating Systems (Hot OS) , Vol. 3. 67–72.
7Carrozza et al . (2013) Gabriella Carrozza, Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono, and Stefano Russo. 2013. Analysis and prediction of mandelbugs in an industrial software system. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation . IEEE, 262–271.
8Cerveira et al . (2015) Frederico Cerveira, Raul Barbosa, Henrique Madeira, and Filipe Araujo. 2015. Recovery for Virtualized Environments. In Proc. EDCC . 25–36.