Challenges for Verifying and Validating Scientific Software in   Computational Materials Science

Thomas Vogel; Stephan Druskat; Markus Scheidgen; Claudia Draxl; Lars; Grunske

arXiv:1906.09179·cs.SE·June 24, 2019

Challenges for Verifying and Validating Scientific Software in Computational Materials Science

Thomas Vogel, Stephan Druskat, Markus Scheidgen, Claudia Draxl, Lars, Grunske

PDF

TL;DR

This paper discusses the importance of quality assurance in scientific software for computational materials science, identifying key challenges and proposing future research directions to improve validation and verification processes.

Contribution

It formulates specific challenges in verifying and validating scientific software in CMS based on domain experience and suggests future research directions.

Findings

01

Identified key challenges in software validation for CMS

02

Outlined future research directions for quality assurance

03

Emphasized importance of trust in scientific software results

Abstract

Many fields of science rely on software systems to answer different research questions. For valid results researchers need to trust the results scientific software produces, and consequently quality assurance is of utmost importance. In this paper we are investigating the impact of quality assurance in the domain of computational materials science (CMS). Based on our experience in this domain we formulate challenges for validation and verification of scientific software and their results. Furthermore, we describe directions for future research that can potentially help dealing with these challenges.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Challenges for Verifying and Validating Scientific Software in Computational Materials Science

Thomas Vogel1, Stephan Druskat13, Markus Scheidgen2, Claudia Draxl2, and Lars Grunske1

1Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany

3German Aerospace Center (DLR), Berlin, Germany

Email: [email protected], [email protected], [email protected]

2Physics Department and IRIS Adlershof, Humboldt-Universität zu Berlin, Berlin, Germany

Email: [email protected], [email protected]

Abstract

Many fields of science rely on software systems to answer different research questions. For valid results researchers need to trust the results scientific software produces, and consequently quality assurance is of utmost importance. In this paper we are investigating the impact of quality assurance in the domain of computational materials science (CMS). Based on our experience in this domain we formulate challenges for validation and verification of scientific software and their results. Furthermore, we describe directions for future research that can potentially help dealing with these challenges.

Index Terms:

Verification and Validation, Scientific Software, Computational Materials Science

I Introduction

Software has become an important driver for research in many scientific disciplines such as biology and physics [1]. Scientists often use software in experiments to produce evidence for the validity of their theories, and publish scientific papers based on this evidence [2]. However, in the worst case the validity of such a computational experiment – and thus of the (published) research results – may be jeopardized if the software producing the evidence is not of sufficient quality. A software that has bugs may produce wrong data leading to erroneous evidence. Accordingly, scientific papers have been retracted in the past due to issues with software [3].

Consequently, software engineering principles are being increasingly adopted [4, 5, 6, 7, 8], and best practices for scientific software development processes have been proposed [9, 10]. At the same time, a clash of cultures between software engineers and domain scientists has been reported [11, 12].

In this context, validation and verification of scientific software are critical, as they establish trust in the software for it to perform the required calculations correctly. In this regard, inadequate behavior of scientific software is a threat to the validity of research results, and has consequently been a main subject of research [13]. To demonstrate the correctness of scientific software, testing is considered essential [14], and has been investigated for scientific software [2, 15, 16, 17, 18, 19], resulting in tools for testing scientific software [20, 21], the beneficial use of reference data for testing [22], and test-driven development methods [23, 24]. Despite these advances in testing scientific software, all approaches suffer from the oracle problem and large variability (i.e., a large configuration space and input domain) of the software under test [19]. Carver et al. [25, p. 554] faced the oracle problem in five case studies of computational science and engineering projects, and concluded: “Validation is problematic because it is often difficult, or even impossible, to establish the correct output or result a priori.” In contrast, testing from a software engineering perspective typically considers accurate oracles, that is, the expected output of the software under test is precisely known. This results in a binary oracle: The calculated output either does or does not match the expected output. This contradicts the nature of scientific software, where oracles are unknown or not precisely known. Moreover, the large variability of scientific software poses a challenge to standard testing tools from software engineering because of the large number of tests that are required to comprehensively test the software. Consequently, tests should be well chosen with the goal of allowing scientists to increase their trust in the software [26].

In this paper, we investigate the validation and verification of scientific software in computational materials science (CMS). CMS is concerned with the design and discovery of new materials using computational methods. Based on our experience in the CMS domain, we discuss corresponding challenges such as (i) the oracle problem, and (ii) large configuration spaces of CMS programs, called codes, taking the specifics of the domain into account. In the context of the development and use of the NOMAD [27] ecosystem of codes and data, we further discuss challenges related to (iii) large-scale, heterogeneous data, and (iv) global software development. Corresponding to these challenges, we proceed to discuss directions for future research on validating and verifying scientific software in CMS.

Throughout the paper, we take the perspective of a CMS scientist who runs calculations to design and analyze materials using a code such as exciting [28], ABINIT [29], or VASP111https://www.vasp.at/, or a data-analysis workflow in NOMAD. As the results of a calculation rely on the validity and correctness of the used code, our goal is to derive trust levels for codes from testing, so that the scientist can increase her trust in the code. This paves the way for trustworthy, reproducible calculations and research results.

II Computational Materials Science

The convergence of theoretical physics and chemistry, materials science and engineering, and computer science into computational materials science (CMS) enables the modeling of materials (both existing materials and those that can be created in the future) at the electronic and atomic level. This allows the accurate prediction of how these materials will behave at the microscopic and macroscopic levels, and of their suitability for specific research and commercial applications. CMS is characterized by a healthy, but heterogeneous ecosystem of many different CMS programs, called codes, developed by different research groups across the globe. These codes are highly domain-specific scientific software packages implementing various theoretical methods. They are executed in high performance computing centers, with millions of CPU hours spent every day, some of them at petascale performance, producing a large stock of equally heterogeneous CMS data.

The NOMAD Center of Excellence222https://nomad-coe.eu/ (EU/Horizon 2020) aims to enable the CMS community to provide CMS data along the FAIR (findable, accessible, interoperable, and re-usable [30]) principles of data sharing. The NOMAD platform provides services that allow scientists to upload raw code inputs and outputs and to automatically convert data from all relevant codes into a code-independent normalized format. It further allows scientists at various levels of expertise to search, inspect, analyze, and visualize all data in this code-independent format. Currently, NOMAD supports over 40 codes, and stores more than 50 million results of complex calculations regarding properties of materials, including those of the largest US databases, provided by several hundred individual researchers and research groups. Its code-independent format uses a hierarchical data schema with over 400 common code-independent and almost 2.000 code-specific attributes.

The architecture of the NOMAD platform (see Fig. 1), consists of six major components:

The raw data files Repository where scientists upload, search, and download raw data.
Parsersand normalizers that convert raw data in a code-specific format to so-called Archive data whose format is code-independent.
The Archive data, that is, the normalized data that can be accessed through an API.
The Analytics Toolkit that allows scientists to apply machine learning techniques to CMS data.
The Encyclopaedia that aggregates calculations to provide a comprehensive and consistent collection of data for all materials.
The advanced visualization that uses 3D and virtual-reality techniques to visualize materials at an atomic level.

Following an hour-glass model, the most crucial part of NOMAD is the Archive of normalized data. All supported 40+ codes use a different format to represent input and output data for their individual simulations/calculations. Codes, data, and data formats differ in the following aspects. First, codes implement different methods, with varying computational parameters – and thus, numerical precision – and individual limitations and trade-offs. Second, codes focus on different aspects and produce different physical properties of a simulated material. For instance, a code may specialize in electrical, optical, or thermal properties. Third, data is provided in different unit systems (e.g., International System Units (SI) or atomic units). Fourth, although most codes use a text format that adheres to some community standards, all quantities are presented in different orders, and matrices and vectors are laid out differently. Quantity values range from strings and dates, simple numerical values, to large vectors, matrices, and tensors of several GB, or even TB, in size. Data formats are not formalized, and documentation is often sparse. Data of individual calculations is often spread over multiple files. Relations between calculations may exist as typically one calculation is based on another. However, such relations are not formalized and have to be deduced from common practices, for instance, a commonly used layout of directories.

To represent data in a normalized and homogeneous form, NOMAD defines an ontology-like data model that unifies all codes with a common schema. The schema is used to formalize, categorize, and document all codes, as well as code-common and code-specific quantities, in a single evolving model, called meta-info. It uses a proprietary schema language that specializes in describing physical quantities (e.g., with units and vector/matrix dimensions). meta-info is independent of distinct technical data formats, and the Archive data can be represented in different technical file formats. For example, NOMAD stores the archive data in HDF5, but the API supports access to the data via a JSON representation.

To convert raw CMS data to Archive data, NOMAD uses 40+ parsers (one per code) and several normalizers. Each set of code input/output data is parsed and then processed by all normalizers. Parsers re-produce all quantities found in the raw data in their respective meta-info form. Normalizers then compute derived properties, classify simulations, convert units, and relate data with other sources (e.g., external materials databases). In computer language terms, parsers and normalizers only work on a syntactical level, all semantics is added by other NOMAD and potential third-party services.

One of the 40+ codes used in the context of NOMAD is exciting, a software package implementing density-functional theory (DFT) and many-body perturbation theory [28]. As suggested by its name, exciting has a major focus on calculating excited-state properties of materials.

III Problem Statement and Challenges

In this section, we first discuss the problem statement, including its relevance to scientific software with a focus on codes in computational materials science (CMS). We then proceed to detail challenges in verifying and validating such software.

III-A Problem Statement

To design and discover new materials, CMS scientists conduct computational experiments, in which large-scale, heterogeneous data is processed by data-analysis workflows including codes. While certain steps of a workflow are concerned with preparing data (e.g., parsers), the codes perform scientific calculations. For an experiment, scientists reuse existing codes and data, as well as develop new codes that produce new data. Subsequently, they combine all elements in a workflow for execution. The results of such experiments often provide the evidence that the scientists’ theories work, which is consequently the basis for scientific papers. A recent example from CMS is the work by Rodrigues Pela et al. [31] who use exciting to perform all required calculations.

In this context, there is a multitude of codes in CMS implementing a range of theoretical methods. Each of these methods relies on a set of computational parameters that govern the numerical precision of the respective implementation of the method. Choosing the best code, and optimal parameters that guarantee high precision is a non-trivial issue, which greatly influences the calculation results. Similar observations of configuration choices impacting research results are reported in bioinformatics [32]. This large configuration space results in high variability, which challenges CMS scientists in predicting how configuration choices impact the results.

Consider, for example, the basic input file for exciting in Listing 1. This input file is used for DFT-1/2 calculations (specifically, LDA-1/2) for silicon, in order to compute single-particle band gaps. It first defines the title (line 2) and the material structure, in this case silicon (lines 3–11). It also features several parameters, such as the presence of dfthalf (see line 19), which triggers the DFT-1/2 calculation rather than the default standard DFT. The DFT-1/2 method is configured by the parameters in lines 12–14. Certain parameters have to be determined variationally to obtain optimal results (e.g., cut) while others are constant (e.g., exponent is usually set to 8).333For further details, see http://exciting-code.org/nitrogen-dft05.

This (extremely simple) example illustrates only a very small fraction of the complexity of configuring codes, as there exist configuration parameters that are mutable or constant, as well as dependencies between parameters (e.g., only if the dfthalf method is selected, it can be further configured by the parameters in dfthalfparam).

Consequently, given the complexity of configuring codes for their use, and of initially developing such codes, scientists may generally ask themselves two questions:

(1)

Is the theoretical method I have developed valid?, or in the case of reusing an existing method: Have I selected and configured appropriately the theoretical method and therefore, the code implementing this method? 2. (2)

Is the code implementing this method correct?

While the first question concerns the validation of software (“Am I building the right product”), the second one addresses its verification (“Am I building the product right”, cf. [33]). Thus, validation is about the adequate (proper use), and verification about the correct (absence of bugs), functional behavior of software such as the codes in CMS. Hence, both validation and verification of CMS codes are required to obtain trustworthy results from calculations. Otherwise, the codes pose a potential major threat to the validity of the experiments and research results, as any inadequate or incorrect code refutes these results. However, validating and verifying codes in computational materials science poses challenges, which we will discuss in the next section.

III-B Challenges for Validating and Verifying CMS Codes

In the following, we discuss major challenges for the verification and validation of scientific software, which - based on our experience in (research) software engineering - are caused by the described factors: experimental nature and complexity of codes, and complexity of the data processed by them.

III-B1 Lack of Precise Oracles

As scientists use computational calculations to explore new ideas and theoretical methods, the outcome of a calculation is generally not known at all, or at least not precisely known a priori [25, 34, 4]. Other reasons for this are the complexity of the calculations, and the fact that the calculations may return a range of different answers, which makes it difficult for scientists to predict the outcome [19].

This causes uncertainty about calculation results, as there is no precise notion of their correctness. Consequently, the same is true for the software used to explore such new ideas and methods, which prohibits precise oracles to be defined for quality assurance techniques such as software testing.

Consider, for example, a CMS scientist who performs one of the following calculations. She simulates existing materials with well-known properties on a new implementation (code) of an existing, altered, or completely new method, or she simulates unknown materials on an existing code of established methods, or even does so in bulk to explore the huge space of unknown potential materials. To be more specific, considering the example presented previously (see Listing 1), estimating the impact of varying input parameters such as cut on the results may be difficult. In all cases, the expected result of the calculation is not known and cannot be predicted a priori.

In contrast, the result obtained for a specific property of an input material is likely to be correct if the calculated property value is statistically similar to that of other well-known materials of the same class, assuming we can classify the material. Outliers, in turn, may indicate one of the following cases:

(i)

Discovery of a highly interesting material. In this positive case, a material has been discovered, whose properties are different from existing materials of the same class. 2. (ii)

Faulty theoretical method and/or parameters. In this negative case, a scientist either made a mistake when developing a new method, which manifests in its implementation (code), or used a nonsensical combination of method and parameters when using an existing method and code. Here, the code is, or may be, free of bugs. 3. (iii)

A bug in the software (code). This is the other negative case, in which the code contains a bug that caused the faulty results. More specifically, the code is a faulty implementation of a valid theoretical method.

This perspective gives rise to developing statistical oracles, which judge the plausibility of computational results and provide corresponding feedback to scientists, which in turn establishes confidence in these results. Non-plausible results, which are expected to be rare, need to be inspected manually and classified according to the cases (i), (ii), and (iii). The use of such statistical oracles is conceivable in quality assurance techniques such as software testing. However, from a software engineering point of view, testing mainly focuses on precise oracles and assertions, so that state-of-the-art and state-of-the-practice testing approaches, or even test-driven development, cannot be directly applied here. Consequently, quality assurance techniques such as systematic testing known from software engineering [4, 10] are rarely adopted for scientific software [19]. Therefore, increasing the confidence of scientists in computational results requires quality assurance techniques which can be applied to scientific software packages a-posteriori and in an automated manner [10].

This results in the following challenges for leveraging statistical oracles in testing of scientific software:

•

Which methods and techniques shall be used to provide a statistical oracle?

•

How can such methods reliably judge the success and potential failure of a set of executed tests?

III-B2 Large Configuration Space

As discussed above, the experimental nature of scientific software (codes) typically results in a large configuration space. This comprises the selection of algorithms provided by a software, as well as fine-tuning the selected algorithm through parameters, which in turn results in high variability and a large number of options for executing the software [35, 36]. At the same time, the choice of configuration influences the calculation results [32].

An example from CMS is the calculation of single-particle band gaps, for which exciting can be customized to perform a calculation that is further configured by a set of parameters (see Section III-A). In the context of NOMAD, 40+ codes such as exciting are used, which multiplies the variability that CMS scientists have to cope with.

This variability challenges scientists to select and configure appropriate codes for calculations. As the selection and configuration of codes can greatly influence the calculation results, CMS scientists should be supported a-priori and in an automated manner during this process. Such support should guide scientists in implementing a method to prevent the introduction of basic faults, before a calculation is conducted. It therefore promotes the validation of the configured methods/codes and of the conducted calculations. In CMS, such support may suggest to scientists the use of trusted codes and methods (including parameters) for specific materials and/or properties that are of interest for a specific calculation. For example, a recommendation may be to use an all-electron code and a self-interaction corrected exchange-correlation function to properly account for electron-electron interactions for a heavy material like cerium.

Moreover, the large configuration space and the corresponding variability of codes also challenges the validation and verification of these codes through testing, in that it is infeasible to test all possible configurations. The large configuration space impedes manual identification of test cases and thereby of configurations to be tested (cf. [19, 36, 35]). Thus, an automated sampling of the configuration space to identify representative configurations to be tested is required. In general, this constitutes a combinatorial interaction testing problem [37] while a solution for this problem has to be tailored to the CMS domain. Consequently, coping with the large configuration space requires automated support for scientists in using codes, as well as intelligent testing techniques, which account for the following challenges:

•

What are appropriate sampling strategies for selecting a subset of scientific computations (i.e., a combination of code, code configuration, and input data in CMS) that are likely to reveal a failure in a scientific software?

•

How to exploit results of previous calculations and test runs of codes, to automatically determine the required support for scientists? Particularly, how to exploit statistical information from automated testing, to suggest methods and corresponding codes (including configuration parameters) to scientists for a specific calculation?

III-B3 Large-Scale, Heterogeneous Data

Scientific software often processes large-scale, heterogeneous data, e.g., in climate research [38], and in CMS [27] where software operates on data up to several TB in size and encoded in different code-specific formats that are mostly neither formalized nor well-documented. Thus, calculations using results of multiple codes in NOMAD require pre- and post-processing steps to transform input/output data between the normalized Archive format (cf. Section II) and the code-specific formats, to integrate machine learning, or for visualization. For instance, parsers (one for each code-specific format) and normalizers are used to translate code-specific input/output data of a code for storage in the Archive and future use. Consequently, codes implementing theoretical methods are embedded in a workflow, together with programs implementing such pre- and post-processing steps. One workflow example is a machine learning approach applied to properties computed by multiple codes over many materials, to find predictors for a specific materials property. Here, the properties have to be computed, parsed, and normalized for the learning.

Consequently, validation and verification have to address the whole workflow. Otherwise, a faulty pre- or post-processing step might introduce faults into the data causing either wrong calculations by the (bug-free) codes, or wrong presentations and interpretations of the results by scientists. Hence, a fault might be located in any part of a data-analysis workflow (cf. NOMAD workflow in Section II).

Thus, the selection of test data (including the pre- and post-processing steps) for testing workflows is crucial. For instance, considering a workflow that classifies materials based on their electrical resistivity and conductivity, tests should cover calculation data from different codes implying different methods, unit systems, respective parser and normalizer chains, as well as representatives from different classes of materials (e.g., super-, semi-, non-, conductors).

This heterogeneity, together with the scale of the data, results in high variability at the data level (in addition to the variability of the codes discussed in the context of large configuration spaces), which challenges the validation and verification of data-analysis workflows:

•

How to identify and sample valid/realistic test data for codes and workflows that likely reveal a failure?

•

How to improve the quality of the pre-/post-processing steps that handle large-scale, heterogeneous data?

III-B4 Global Software Development

In CMS, scientists across the globe explore theoretical methods and develop codes. An ecosystem of several hundred scientists and research groups has emerged around NOMAD, fostering reuse of data for new and reproducing calculations, reuse of codes in workflows, and development of new codes based on existing ones.

However, reuse is often kept implicit, e.g., for lack of common workflow descriptions [39]. For instance, relations between calculations do exist, but such relations often have to be deduced from common practices such as a commonly used layout of directories. For example, to derive elasticity properties for a material with exciting, a series of simulations with varying forces acting on the simulated material has to be performed [40]. Only an analysis of all these simulations allows scientists to derive the desired elastic constants. However, the intent behind the series of simulations is not always formalized. From the perspective of NOMAD, or data reuse in general, the parameter study’s relations between those simulations and their underlying intent have to be deduced. Even originally unrelated simulations from different codes could be used in a parameter study, provided one identifies respective data based on comparable methods and parameters.

Moreover, codes are sometimes not well-documented – with regards to any and all levels of documentation, e.g., requirements, system modeling, architectural design, maintenance guidelines, and user documentation – or no longer maintained, their data format may not be formalized, and the corresponding parser may only produce partial parses of the format. Finally, the quality of a code might be unknown or the quality might differ, depending on the degree to which quality assurance techniques such as testing are adopted. These aspects are caused by general issues of global software development concerning knowledge, project, and process management [41], and they challenge the validation and verification of codes that are (re)used by scientists other than the scientists developing the codes:

•

How to validate and verify third-party code that is not well-documented, not sufficiently tested, and whose data format is not formalized? How to achieve trustworthy workflows that use data from different sources in different codes?

•

How to extract and mine relations between calculations to leverage integration testing and to generally improve quality/trust levels of codes, for instance, by externalizing assurances obtained for reused codes?

IV Directions for Future Research

IV-1 Lack of Precise Oracles

Currently, the confidence about scientific research results in the CMS domain is addressed by scientific workflow systems using the notion of provenance, since all executions of codes and the corresponding input and output data are documented in NOMAD 444https://metainfo.nomad-coe.eu/nomadmetainfo_public/archive.html. A recent community effort [42] has, for the first time, assessed and compared the quality of DFT results computed by several codes for a set of materials. More recently, the effect of computational parameters have been systematically assessed, involving four different codes [43]. The goal here is to automate the collection of workflow metadata to enable reproducibility of scientific results.

Based on the available NOMAD data, the next step will be to define a notion of a statistical oracle, which uses statistical methods for identifying the correctness of a computational calculation [44]. Unlike usual oracles used in software testing such as oracles derived from requirement specifications or models, gold standard oracles, or human oracles, the decision of a statistical oracle as envisioned is, by definition, not always correct. To apply statistical methods, results in the neighborhood of the computational calculation need to be investigated. Chan et al. [45] provide a general algorithm based on mesh specifications and machine learning for this problem. However, defining the neighborhood in CMS requires looking at the used materials, codes including its parameters, and computational environment. Furthermore, the selection of appropriate heuristics, which keep the oracle’s failure at a minimum, is an open problem that has been little researched in general [44] and needs to be tailored to data-driven CMS. Finally, if the neighborhood of a calculation is not available in NOMAD, specific computational calculations can be provided by mutation sensitivity testing [46] and modeling as well as approximation techniques of the input space [36].

Beyond this, it will be very interesting to apply the concept of metamorphic testing [47] that is specifically designed to test software without an oracle [48, 49]. The idea is to identify and refine a set of metamorphic relations between the software inputs and outputs. Just to give an abstract example, for a square root function sqrt(x) the relation x=sqrt(x)*sqrt(x) should hold under reasonable floating point accuracy assumptions. Identifying such relations is highly domain dependent. However, automatic techniques based on machine learning have been proposed [50, 44, 51, 52] and successfully applied to the bio-medical [53] and particle physics [54] domains. Transferring the concept of metamorphic testing requires domain expertise since the identified relations need to be understood and explained. The explainability of the relations is a primary challenge. However, it is also a significant opportunity for the CMS community since the scientists might learn hidden relations from their codes which were previously unknown. This may strengthen the understanding and help to refine the underlying theories.

IV-2 Large Configuration Space

Codes in CMS are used by selecting a desired method and by fine-tuning this method through a set of parameters. Thus, there is not the perfect implementation, but each code is actually a tool box that can be instantiated in a huge number of variants. In sum, this leads to a combinatorial explosion of possible computations, and testing all of them is infeasible. Instead, appropriate sampling strategies are required which are effective and efficient at the same time. Possible scenarios for first tests are, for instance, to stay within the same code family and vary, for a given method, the parameter space; or select the best possible (fully converged) calculations from different codes.

The steps for future research to deal with the large configuration space when verifying and validating CMS codes and calculations require effective methods for configuration space sampling and automated test input generation. Concerning the sampling of suitable input data for generated test cases, one idea [35] is to apply combinatorial testing and test case selection techniques, which have been exploited in software product line (SPL) engineering [55, 56]. The goal of these techniques is to select a promising subset of product variants when testing all variants of an SPL is infeasible. However, SPL engineering focuses on testing interactions of features, which may be present in a product variant, or not. In contrast, the configuration space of CMS software comprises a set of non-boolean parameters, which demand different coverage metrics and sampling strategies. Furthermore, our hypothesis is that the size of the configuration space exceeds the configuration space of very large existing SPLs such as the Linux kernel. As a result, we have to enrich the sampling strategies with modeling techniques for the input space as proposed by Vilkomir et al. [36]. Another direction for future research is to use a recommender system exploiting statistical information obtained from existing computational calculations in the NOMAD Repository. Passing and non-passing test cases will be classified statistically to provide valuable information about the adequacy of code configurations. Our aim is to exploit this information to derive a recommender system that assists scientists in configuring codes for their specific needs.

IV-3 Large-Scale, Heterogeneous Data

There are two aspects of the data diversity problem in CMS. First, we have different representations of the same information, for instance, different file formats, layouts of matrices, units, etc. Second, codes provide different kinds of information, for instance, codes specializing in electronic properties vs. codes specializing in thermal properties. The former problem can be solved by finding the right abstractions, the latter by defining relations between properties (e.g., identifying generalizations, categories of properties, or associations). Both aspects can be tackled by formal data models.

Modeling data has a long history in computer science, and has different methods in different technical spaces [57] such as schemas for data exchange (XML, JSON), relational algebra in databases, ontologies in semantic web, or formal grammars and meta-models in computer languages. Applications often require transforming data from a representation in one space to a representation in another (e.g., reading data from a database organized in tables and sending it over the internet in hierarchically nested JSON format). The scale of the data increases the problem since specialized technologies have to be combined. For instance, search engines, distributed computing platforms, and nosql-databases have to work hand in hand. Each technology potentially requires its own specialized data representation. To cope with this, data must be modeled at a level that is independent from concrete technical spaces.

The CMS domain (or scientific software in general) presents a further challenge, since most existing methods for formally defining data types fall short as they neglect the nature of scientific data and offer no or insufficient support for vectors, matrices, tensors, their dimensions, and units. Therefore, NOMAD defines its own schema language meta-info [58] that is independent of the concrete data representation (e.g., text files, HDF5 files, or databases). In all its representations, data retains its inherent structure and types as defined in meta-info. Furthermore, meta-info categorizes properties into sections, and defines relations between properties and their categories. Some of the meta-info is common and shared by many codes, some definitions are code specific.

This formal model of CMS data can foster the quality of codes and workflows in several ways, e.g., by applying methods from model-based testing. First, a formal model can support generating realistic large-scale test data and asserting test coverage with respect to the input of codes. Secondly, it is a formal definition of the possible data space. Constraints defined at the meta-info level can be used to automatically assert the plausibility of calculated properties. Finally, it can automatize the development of mappings between technical spaces (i.e., parsers and normalizers) by declaratively defining mappings, from which operational transformations are automatically derived. This avoids error-prone manual implementations of parsers and normalizers.

IV-4 Global Software Development

Global software development challenges to the verification and validation of codes (re)used in CMS studies pertain mainly to two factors:

(i) The large and diverse development ecosystem, which produces codes that differ in quality (e.g., levels of documentation and testing);

(ii) the lack of explication of intent when combining multiple calculations and codes in workflows.

Efforts to consolidate the diversity of the ecosystem in terms of software quality will have to be implemented as community processes. Code development should adopt best practices of software engineering [10, 59]. These practices must be adapted to the needs of CMS, for instance, in regard to testing (cf. Section IV-1). Similar efforts have been made in astronomy555https://eas.unige.ch//EWASS2017/session.jsp?id=SS16. Such efforts will ease the integration and testing of codes developed by other scientists in workflows.

Despite the use of workflow systems in CMS (cf. Section IV-1), metadata explicating the intent behind a parametrization and combination of calculations/codes within a single study is often missing. Therefore, any intent can only be deduced from potentially interrelated, non-formalized information such as directory structures. To explicate intent, future efforts should develop and apply requirements for formalized metadata, for instance, by using the Common Workflow Language [39], a specification for portable and scalable workflow descriptions with dedicated metadata. Similarly to efforts regarding the development ecosystem as such, this must be achieved through a standardization process within the CMS community. Additionally, automatic methods for the discovery of intentional process models based on Hidden Markov Models [60, 61] can be adapted to mine implicit relations between calculations/codes. These models can guide the integration testing of data analysis workflows.

V Conclusions

In this paper we discussed challenges for the validation and verification of scientific software in computational materials science (CMS). We conclude that most of the problems are similar to other domains [44, 23, 21, 14, 26, 16, 18, 62, 63] and solution principles derived for the CMS domain might be generalizable to other domains. However, the effort of the CMS community to provide results of their computational experiments in the NOMAD Repository [27] based on the FAIR principle [30] provides a significant opportunity for fundamental research on validation and verification of scientific software. For instance, based on the NOMAD data, novel strategies to tackle the oracle problem, can be developed. With this research, we envision trust levels for codes so that scientists increase their trust in codes to obtain trustworthy, reproducible calculations and research results.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. C. Carver, N. P. C. Hong, and G. K. Thiruvathukal, Software Engineering for Science . CRC Press, 2016.
2[2] R. Sanders and D. Kelly, “Dealing with risk in scientific software development,” IEEE Software , vol. 25, no. 4, pp. 21–28, 2008.
3[3] G. Miller, “A scientist’s nightmare: Software problem leads to five retractions,” Science , vol. 314, no. 5807, pp. 1856–1857, 2006.
4[4] J. E. Hannay, C. Mac Leod, J. Singer, H. P. Langtangen, D. Pfahl, and G. Wilson, “How do scientists develop and use scientific software?” in ICSE Workshop on Software Engineering for Computational Science and Engineering , ser. SECSE. IEEE, 2009, pp. 1–8.
5[5] L. Nguyen-Hoan, S. Flint, and R. Sankaranarayana, “A survey of scientific software development,” in International Symposium on Empirical Software Engineering and Measurement . ACM, 2010, pp. 12:1–12:10.
6[6] D. Heaton and J. C. Carver, “Claims about the use of software engineering practices in science: A systematic literature review,” Information & Software Technology , vol. 67, pp. 207–219, 2015.
7[7] T. Storer, “Bridging the chasm: A survey of software engineering practice in scientific programming,” ACM Comput. Surv. , vol. 50, no. 4, pp. 47:1–47:32, 2017.
8[8] A. N. Johanson and W. Hasselbring, “Software engineering for computational science: Past, present, future,” Computing in Science and Engineering , vol. 20, no. 2, pp. 90–109, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Challenges for Verifying and Validating Scientific Software in Computational Materials Science

Abstract

Index Terms:

I Introduction

II Computational Materials Science

III Problem Statement and Challenges

III-A Problem Statement

III-B *Challenges for Validating and Verifying CMS *Codes

III-B1 Lack of Precise Oracles

III-B2 Large Configuration Space

III-B3 Large-Scale, Heterogeneous Data

III-B4 Global Software Development

IV Directions for Future Research

IV-1 Lack of Precise Oracles

IV-2 Large Configuration Space

IV-3 Large-Scale, Heterogeneous Data

IV-4 Global Software Development

V Conclusions

III-B Challenges for Validating and Verifying CMS Codes