Facilitating Rapid Prototyping in the OODIDA Data Analytics Platform via   Active-Code Replacement

Gregor Ulm; Simon Smith; Adrian Nilsson; Emil Gustavsson; Mats; Jirstrand

arXiv:1903.09477·cs.DC·January 1, 2021

Facilitating Rapid Prototyping in the OODIDA Data Analytics Platform via Active-Code Replacement

Gregor Ulm, Simon Smith, Adrian Nilsson, Emil Gustavsson, Mats, Jirstrand

PDF

TL;DR

This paper introduces an active-code replacement feature in the OODIDA platform, enabling rapid, on-the-fly updates to user-defined analytics modules, significantly reducing deployment time and facilitating rapid prototyping for data analysts.

Contribution

It presents a user-friendly approach for active-code replacement in OODIDA, allowing quick updates without system restarts, suitable for non-expert users and enhancing rapid prototyping capabilities.

Findings

01

Active-code replacement executes in less than a second.

02

It enables iterative testing of machine learning algorithms.

03

The feature improves system flexibility and reduces downtime.

Abstract

OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributed real-time analytics, targeting fleets of reference vehicles in the automotive industry. Its users are data analysts. The bulk of the data analytics tasks are performed by clients (on-board), while a central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. As this is potentially disruptive, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system; they can even be replaced between iterations of an ongoing assignment. This feature is referred to as active-code replacement. It facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Fraunhofer-Chalmers Research Centre for Industrial Mathematics,

Chalmers Science Park, 412 88 Gothenburg, Sweden

22institutetext: Fraunhofer Center for Machine Learning,

Chalmers Science Park, 412 88 Gothenburg, Sweden

22email: {gregor.ulm, simon.smith, adrian.nilsson, emil.gustavsson, mats.jirstrand}@fcc.chalmers.se

http://www.fcc.chalmers.se/

Facilitating Rapid Prototyping in the

Distributed Data Analytics Platform

OODIDA via Active-Code Replacement††thanks: The final authenticated version is available online at https://doi.org/10.1016/j.array.2020.100043.

Gregor Ulm 11 2 2 0000-0001-7848-4883

Simon Smith 11 2 2 0000-0001-8525-2474

Adrian Nilsson 11 2 2 0000-0002-8927-845X

Emil Gustavsson 11 2 2 0000-0002-1290-9989

Mats Jirstrand 11 2 2 0000-0002-6612-8037

Abstract

OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributed real-time analytics, targeting fleets of reference vehicles in the automotive industry. Its users are data analysts. The bulk of the data analytics tasks are performed by clients (on-board), while a central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. As this is potentially disruptive, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system; they can even be replaced between iterations of an ongoing assignment. This feature is referred to as active-code replacement. It facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly. Various safeguards are in place to ensure that custom code does not have harmful consequences, for instance by limiting the allowed types for return values or prohibiting importing of certain modules of the Python standard library. Consistency of results is achieved by majority vote, which prevents tainted state. Our evaluation shows that active-code replacement can be done in less than a second in an idealized setting whereas a standard deployment takes many orders of magnitude more time. The main contribution of this paper is the description of a relatively straightforward approach to active-code replacement that is very user-friendly. It enables a data analyst to quickly execute custom code on the cloud server as well as on client devices. Sensible safeguards and design decisions ensure that this feature can be used by non-specialists who are not familiar with the implementation of OODIDA in general or this feature in particular. As a consequence of adding the active-code replacement feature, OODIDA is now very well-suited for rapid prototyping.

Keywords:

Distributed computing, Concurrent computing, Distributed Data Processing, Hot Swapping, Code Replacement, Erlang

1 Introduction

OODIDA [ulm2019oodida] is a modular system for concurrent distributed data analytics, with a particular focus on the automotive domain. It processes in-vehicle data at its source instead of transferring all data over the network and processing it on a central server. A data analyst interacting with this system uses a Python library that assists in creating and validating assignment specifications which consist of two parts: on-board tasks carried out by the on-board unit (OBU) in a reference vehicle, and an off-board task that is executed on a central cloud server. Several domain-specific algorithms and methods of descriptive statistics have been implemented in OODIDA. However, updating this system is time-consuming and disruptive as it necessitates terminating and redeploying software. Instead, we would like to perform an update without terminating ongoing tasks. We have therefore extended our system with the ability to execute custom code, without having to redeploy any part of the installation. This enables users to define and execute custom computations both on client devices and the server. This is an example of a dynamic code update. With this feature, users of our system are able to carry out their work, which largely consists of either tweaking existing methods for data analytics or developing new ones, with much faster turnaround times, allowing them to reap the benefits of rapid prototyping.

In this paper, we describe the active-code replacement feature of OODIDA. We start off with relevant background information in Sect. 2, which includes a brief overview of our system. In Sect. 3 we cover the implementation of this feature, showing how Erlang/OTP and Python interact. We elaborate on the reasoning behind our design considerations, including deliberate limitations, and show how it enables rapid prototyping. Afterwards, we show a quantitative as well as a qualitative evaluation of active-code reloading in Sect. LABEL:eval, before we continue with related work in Sect. LABEL:related, plans for future work in Sect. LABEL:future, and end with the conclusion in Sect. LABEL:conclusion.

A condensed version of this paper has been previously published [ulm2019]. That paper presents a quick summary of the active-code replacement feature of the OODIDA platform. In contrast, this paper provides both more depth, covering various implementation details and extensive technical background, as well as increased breadth by giving a more thorough description of relevant parts of our system.

2 Background

In this section, we describe the relevant background of active-code replacement in the OODIDA platform. We start with a brief overview of OODIDA (Sect. 2.1), including a description of assignment specifications and the user front-end application, before we show how our system can be extended with new computational methods (Sect. 2.2). This leads to the motivating use case that describes on-the-fly updating of the system without taking any part of it down (Sect. 2.3).

2.1 OODIDA Overview

This subsection contains a condensed description of OODIDA, which is comprehensively described elsewhere [ulm2019oodida]. After a brief overview and an example, we highlight some technical details as well as the status quo ante of our system for prototyping before the addition of the active-code reloading feature.

2.1.1 Basic idea.

OODIDA is a platform for distributed real-time data analytics in the automotive domain, targeting a fleet of reference vehicles. It connects $m$ analysts to $n$ vehicles. The architecture diagram is shown in Fig. 1. Analysts use OODIDA for data analytics tasks by creating assignments, which are translated into tasks for connected vehicles. These vehicles contain an on-board unit (OBU) that is fit for general-purpose computing. Yet, OBUs are used merely for data analytics. They do not interfere with controlling any part of the vehicle and instead only read CAN bus data. Data analysts use OODIDA for executing various statistical methods and machine learning algorithms. After updating OODIDA in-house, the system can be deployed remotely, which makes new features available to all analysts. A particular focus of this system is on large-scale concurrency: analysts can issue a multitude of tasks to different subsets of clients that are all carried out concurrently. The bottleneck is the available hardware in the vehicles, but experimental results show that we can easily carry out dozens of typical analytics tasks concurrently [ulm2019oodida].

The problem our system solves is that data generation of a connected vehicle outpaces increases in bandwidth. There is simply too much data to transfer to a central server for processing. Instead, with OODIDA, data is primarily processed on clients and in real time, which leads to cost savings as transmission and storage costs can be greatly reduced. Furthermore, data analysts can get insights a lot faster, which is valuable for business.

2.1.2 Technical details

OODIDA is a distributed system that runs on three kinds of hardware: data analysts use workstations, the server application runs on an internal private cloud, and client applications are executed on OBUs. Our system can accommodate multiple users, but in order to simplify the presentation, we mainly focus on a single-user instance. In Fig. 2, the context of our system is shown, indicating that a data analyst uses a front-end $f$ . In turn, $f$ is connected to a user module $u$ that communicates with the central cloud application $b$ (bridge). The workstation of the data analyst executes both $f$ and $u$ . Node $b$ communicates with client nodes $\boldsymbol{c}$ on OBUs. Each $c$ interacts with an external application $a$ . Data analysts use a front-end application $f$ to generate assignment specifications, which are consumed by $u$ and forwarded to $b$ . On $b$ , assignments are divided into tasks and forwarded to the chosen subset of clients. On client devices, external applications $\boldsymbol{a}$ perform analytics tasks, the results of which are sent to $b$ , where optional off-board tasks are performed. Assignments can be executed concurrently.

Building on this more general view, Figs. 2 and 3 present further details of the underlying message-passing infrastructure, which has been implemented in Erlang/OTP. We start with the user node $u$ , which is identical to $u$ in Fig. 2. The user defines an assignment specification with the help of $f$ , which forwards it to $u$ . In turn, $u$ forwards it via the network to $b$ . That node spawns a temporary assignment handler $b^{\prime}$ , which divides the assignment into tasks and distributes them to client devices. Our illustration shows three client nodes $x,y$ and $z$ . Both the client node and its task handler are executed on an OBU, just like the external application $a$ shown in Fig. 3. Each client spawns a temporary task handler per received task. For instance, client node $x$ spawns task handler $x^{\prime}$ . Task handlers communicate the task specification to the external application $a$ , which performs the requested computational work. Once the results are available, they are picked up by the task handler and forwarded to the originating assignment handler $b^{\prime}$ . Afterwards, the task handler on the client terminates. Once $b^{\prime}$ has received the results from all involved task handlers, it performs optional off-board computations, sends the results to $b$ , and terminates. Finally, $b$ sends the assignment results to the user process $u$ , which communicates them to $f$ . In which order the various nodes are involved when processing an assignment, including external applications on the client and cloud, is shown in Fig. 4. The specified on-board and off-board computations can be carried out in an arbitrary programming language as our system uses a language-independent JSON interface. However, we focus on a simplified version of OODIDA that only uses Python applications to execute both on-board and off-board tasks.

2.1.3 Example Assignment.

Assignments consist of an on-board task, performed on a central server, and an off-board task, performed by each client that is contained in the selected subset of clients. An example of an assignment specification is provided in Listing LABEL:assignment, which shows an example of a relatively basic assignment and its definition as an object in Python. The on-board task is executed on the chosen subset of clients, and the off-board task on the central cloud server. The provided example shows an instance of anomaly detection. The entire fleet of vehicles is monitored, with the goal of detecting whenever a vehicle exceeds a speed threshold value of 100. In order to do so, the user specifies the keyword collect, which collects values from the provided list of signals at a certain frequency for a total of $n$ times. In the given example, each client collects 36,000 samples at a frequency of 10 Hz. In total, this means that we monitor each vehicle for a total of 60 minutes. In general, the off-board part of an assignment is relatively inexpensive. While it is possible to perform arbitrary computations on the server as well, it most commonly collects results from clients and forwards them to the user the assignment originated from. This happens in our example as well. However, on top, the keyword iterations with the value ’10’ is used which indicates that the cloud server will issue the on-board assignment sequentially ten times. Thus, the anomaly detection task will run for ten consecutive hours, with a summary report being sent to the user after each hour.

Lastly, both the on-board and off-board objects are combined into a Spec object for the assignment specification. The user is expected to name such an assignment as well as select a subset of clients, which can be done as random selection of $c$ clients, a numerical selection based on client IDs, selection based on the vehicle model, or, like in our case, as an assignment that is sent out to all clients. In practice the keyword all is only relevant for some assignments because not all signals are available in all vehicles.

2.1.4 Prototyping

In general, an assignment is a tuple of a chosen algorithm and its parameter values, a subset of signals, and the duration, which is determined by the number of samples and a frequency. This implies that the number of potential assignments is very large but has an upper bound as the number of combinations is finite. Given a reasonably large selection of algorithms to choose from, OODIDA is quite powerful. Yet, a data analyst using this system may want to also deploy novel algorithms. The previously mentioned Listing is a good example: a data analyst interacting with OODIDA may want to start an anomaly detection workflow by filtering out all vehicles that reach a speed of 100 km/h. Yet, this may not be fully sufficient to detect dangerous driving. Consequently, additional criteria may need to be met. For instance, one hypothesis could entail that dangerous driving means driving at a speed of at least 100 km/h for 25% of the time, but a competing hypothesis could be that driving is only dangerous if it is accompanied by sudden steering angle deviations, or by any steering angle deviation past some threshold. This hints at two needs: first, concurrently testing competing hypotheses and, second, executing computations that may not be definable with the standard methods that are available on our platform.

2.1.5 User Front-end Application

OODIDA provides a Python front-end application $f$ for the data analyst for easy creation and validation of assignment specifications. Assignment specifications, an example of which we just detailed, are ultimately turned into JSON objects. Because the manual creation of an assignment as a Python dictionary is error-prone, $f$ automatically verifies the correctness of the provided values. Checks include completeness and correctness of the provided dictionary keys, type-checking the corresponding dictionary values, and verifying that their range is valid. For instance, the value for the field frequency has to be a positive integer and cannot exceed a certain threshold. Separately verifying assignment specifications is necessitated by the rudimentary type system of Python. In programming languages with a more expressive type system, e.g. Hindley-Milner type inference or Martin-Löf dependent types, some of those checks could be performed by the compiler. If the validation is successful, the configuration dictionary is converted into a JSON object and sent to the user process $u$ , which is also executed on the workstation of the data analyst.

2.2 Extending OODIDA

As OODIDA has been designed for rapid prototyping, there is the frequent need of extending it with new computational methods, both for on-board and off-board processing. In Fig. 4 a simplified representation of the workflow of OODIDA is given. In short, to extend the system, the worker nodes have to be updated. These are the applications, mostly implemented in Python, that perform on-board and off-board computations. They interact with an arbitrary number of assignment handlers (off-board) and task handlers (on-board). In order to update OODIDA with new computational methods, the system has to be modified. For the user, the only visible change is a new keyword and some associated parameters, if needed. Assuming that we update both the on-board and off-board application, the following steps are required:

•

Update user front-end $f$ to recognize the new off-board and on-board keywords

•

Add checks of necessary assignment parameter values to $f$

•

Add new keyword and associated methods to cloud application worker

•

Add new keyword and associated methods to client application workers

•

Terminate all currently ongoing assignments

•

Shut down OODIDA on the cloud and all clients

•

Redeploy OODIDA

•

Restart OODIDA

Unfortunately, this is a potentially disruptive procedure, not even taking into account potentially long-winded software development processes in large organizations. OODIDA has been designed with rapid prototyping in mind, but although it can be very quickly deployed and restarted, the original version cannot be extended while it is up and running. This was possible with ffl-erl [ulm2019b], a precursor that was fully implemented in Erlang, which allows so-called hot-code reloading. There are some workarounds to keep OODIDA up-to-date, for instance by automatically redeploying it once a day. As we are targeting a comparatively small fleet of reference vehicles, this is a manageable inconvenience. Yet, users interacting with our system would reap the benefits of a much faster turnaround time if they were able to add computational methods without restarting any of the nodes at all.

2.3 Motivating Use Cases

While the previous implementation of OODDIA works very well for issuing standard assignments, there are some limitations. The biggest one is that adding additional algorithms requires updating the worker node on clients or the cloud (cf. Fig. 4). This causes all currently ongoing tasks to be terminated and therefore disincentivizes experimentation. Some tasks may have a runtime of hours, after all. Furthermore, there is the problem that it may not be desirable to permanently add an experimental algorithm to the library on the client. This entails that experimental execution requires two updates, first to push the code update, and afterwards to restore the state before the update.

The first exemplary use case we consider consists of temporarily adding an algorithm to the external client application. If that algorithm proves to be useful, it can be added to all clients via an update of the client software. Otherwise, no particular steps have to be taken as the custom piece of code on the client can be easily deleted or replaced by new custom code. A second, and related use case, consists of temporarily adding different algorithms to non-overlapping subsets of client devices, e.g. running two variations of an algorithm, with the goal of evaluating them. Thus, actionable insights can be generated at a much faster pace than the procedure outlined in Sect. 2.2 would allow.

Lastly, there is the issue of extensibility. Python is a mainstream programming language with a rich ecosystem. There are very comprehensive external libraries available, which are useful for OODIDA, such as the machine learning libraries Keras [chollet2015keras] and scikit-learn [pedregosa2011scikit]. As those are vast projects, it is infeasible to create hooks for an entire library. Yet, it is occasionally useful to call a function of those libraries, in which case a data analyst can define a custom-code module that loads the external library and calls that function. Consequently, active-code replacement provides an easy way of quickly accessing the functionality of third-party libraries.

It is also important to keep in mind standard data analytics workflows: Very commonly, analysts write glue code that uses existing libraries. For instance, before running a method with user-selected parameters, data may need to be preprocessed (e.g. consider the sklearn.preprocessing package). This kind of task is commonly expressed in short scripts. With the feature described in this paper, it is possible to deploy such code to a client. It is very helpful to be able to execute custom preprocessing routines on client devices as this enables new use cases, based on the assumption that it is not feasible to collect data from a large number of client devices in realtime, due to its volume. Another very important task for data analytics workflows is algorithmic exploration that goes beyond merely tuning parameters of algorithms. Instead, this may mean modifying the source code of an existing algorithm or executing algorithms that were written by the analysts themselves. In either case, the code that needs to be deployed tends to be relatively short. As these examples show, it is obvious that the benefit of being able to deploy and execute custom code as opposed to fully redeploying the client software installation leads to a much faster turnaround time. It also enables an entirely different way of working as the ability to quickly deploy custom code heavily encourages experimentation.

3 Solution

In this section we describe our engineering solution to the problem of replacing active code in our system. We start with the assignment specification the data analyst produces (Sect. 3.1). Afterwards, we focus on the underlying mechanisms for getting a custom piece of code from the data analyst to the cloud as well as client devices (Sect. 3.2). This is followed by discussing implementation details that make it possible to keep devices running while replacing a piece of code (Sect. 3.3), followed by our approach to ensuring consistency of results, based on the fact that not all clients may be updated at the exact same time (Sect. 3.4). Then we highlight security considerations (Sect. 3.5). Afterwards we show how a complex use case can be implemented with active-code replacement (Sect. 3.6) and discuss deliberate limitations of our solution (Sect. LABEL:sol6).

3.1 Using Custom Code in an Assignment

In line with the guiding principle that OODIDA should make it as easy as possible for the data analyst to do their job, active-code replacement has been designed to minimize the need for interventions. The data analyst only has to carry out two steps. The first is providing a stand-alone Python module with the custom code. It could include imports, which, of course, the user has to ensure to be available on the target OBUs. The only requirement for the structure of the custom code module is that it contains a function custom_code as an entry point, which takes exactly one argument. This is the function that is called on the cloud or client. Additional parameters have to be hard-coded. Before being able to call custom code, it needs to be deployed. To do so, the user needs to specify the location of the code file on their machine and afterwards call the function deploy_code, which takes as an argument the target, onboard or offboard, the location of the file and optionally a specification of the intended clients, which is ultimately a list of client IDs. In Listing LABEL:lst:custom we call a helper function to retrieve the IDs of all vehicles of a particular type. It is possible to send different modules to non-overlapping subsets of clients via subsequent assignments.

The verification process of the user front-end application consists of two steps. First, the provided module has to be syntactically correct, which is done by loading it in Python. The second check targets the prescribed function custom_code. That function is called with the expected input format, depending on whether it is called on the client or the cloud. We also verify that the returned values are of the expected type. If any of these assertions fail, the assignment is discarded. Otherwise, the custom Python module is sent to the cloud or to clients, depending on the provided instructions. This step is preceded by producing another assignment specification that, in either case, contains the entries user_id and custom_code. The value of the latter is an encoding of the user-provided Python module. The value of the key mode is either deploy_offboard or deploy_onboard. Once custom code has been deployed, it can be referred to in assignments by setting the value of the keys onboard or offboard to custom.

3.2 Code Forwarding

Assuming the provided custom code for the client has passed the verification stage, it is turned into a JSON object and ingested by the user module for further processing. Within that JSON object, the user-defined code is represented as an encoded text string. The user module extracts all relevant values from the provided JSON object and forwards it to the cloud process $b$ . In turn, $b$ spawns a new assignment handler $b^{\prime}$ for this particular assignment. The next step depends on whether custom code for the server or client devices has been provided.

The process of turning an assignment into tasks for client devices does not depend on the provided values and is thus unchanged from the description in the paper on OODIDA [ulm2019oodida] or the brief summary presented earlier in this paper. Node $b^{\prime}$ breaks the assignment specification down into tasks for all clients specified in the assignment. After this is done, task specifications are sent to the designated client processes. Each client process spawns a task handler for the current task. Its purpose is to monitor task completion, besides alleviating the edge process from that burden and enabling it to process further task specifications concurrently. In our case, the task handler sends the task specification in JSON to an external Python application, which turns the given code into a file, thus recreating the Python module the data analyst initially provided. The name of the resulting file also contains the ID of the user who provided it. After the task handler is done, it notifies the assignment handler and terminates. Similarly, once the assignment handler has received responses from all task handlers, it sends a status message to the cloud node and terminates. The cloud node sends a status message to inform the user that the custom code has been successfully deployed. Deploying custom code to the cloud is similar, the main difference being that $b^{\prime}$ communicates with the external Python worker application running on the cloud.

3.3 Code Replacement

Computations are performed only after the specified amount of data has been gathered. This implies that a custom code module can be safely replaced as long as data collection is ongoing. The case where an update collides with a function call to custom code is discussed in Sect. 3.4. If a custom on-board or off-board computation is triggered by the keyword custom, Python loads the user-provided module using the function reload from the standard library. This happens in a separate process using the multiprocessing library. The motivation behind this choice is to enable concurrency in the client application as well as to avoid some technical issues with reloading in Python, which would retain definitions from a previously used custom module. Instead, our approach creates a blank slate for each reload.

The user-specified module is located at a predefined path, which is known to the reload function. Once loaded, the custom function is applied to the available data in the final aggregation step, which is performed once and at the end of a task or assignment. When an assignment using a module with custom code is active, the external applications reload the custom module with each iteration. This may be unexpected, but it leads to greater flexibility. Consider an assignment that runs for an indefinite number of iterations. As the external applications can process tasks concurrently, and code replacement is just another task, the data analyst can, for instance, react to intermediate results by deploying custom code, with modified algorithmic parameters, that is used in an ongoing assignment as soon as it becomes available. As custom code is tied to a unique user ID, there is furthermore no interference due to custom code deployed by other users as every unique user ID is tied to a unique user account, and each user of the system has their own account.

One theoretical issue with our approach is that modules may be reloaded repeatedly, which is inefficient. Yet, OODIDA was not designed with the idea of running arbitrary libraries on the client; instead, it has a strict focus on distributed data analytics. This entails that external libraries do not pose a problem as bread-and-butter libraries such as scikit-learn and Keras are loaded already when the client application is started. Thus, these modules are available to a custom module and should not be imported again by custom code; such imports are reported by the user-side validator. On top, the user does not have the ability to deploy additional libraries by themselves. Instead, they can only access the Python standard library, with some limitations, and a small set of third-party libraries. Consequently, code that is imported with each iteration tends to be small and does not depend on additional external libraries.

3.4 Ensuring Consistency

Inconsistent updates are a problem in practice, i.e. results sent from clients may have been produced with different custom code modules in the same iteration of an assignment. This happens if not all clients receive the updated custom code before the end of the current iteration. In a streaming context, where clients have the ability to peek into results to get intermediate updates, the same issue could emerge, namely that clients use different versions of custom code for their computations. To solve this problem, each provided module with custom code is tagged with its md5 hash signature. This signature is reported together with the results from the clients. The cloud only uses the results tagged with the signature that achieves a majority and discards all others. Consequently, results are never tainted by using different versions of custom code in the same iteration. The expectation is that any updated custom code would eventually, and quickly, reach a majority. An update may not succeed for various reasons. If it is because a client has become unavailable, then said client cannot send any results anyway. Should an update not succeed, then the client reports an error. In that case, the update has to be repeated.

It is possible that a new custom code version arrives at the same time the client application wants to load it. This is one example where the standard approach would be to roll back the update and instead use the previous custom code version. However, this scenario is less of a concern for us as we replace computational methods instead of system-level software. There is deliberately no mechanism for a rollback as the old version of the custom code was supposed to be replaced by new custom code, which implies that any results that could be generated by the old code instead of the not-yet-available new one are not of any interest to the user anymore. The only exception where a rollback would be helpful is in the pathological case where the old and new version of the provided custom code are identical, and the update with the new custom code failed. Consequently, we deliberately let this one computation fail and the client report an error. For the next iteration, the new version of the custom code can be expected to be available, which means that this issue resolves itself quickly in practice.

3.5 Security Measures

The design of our system addresses both internal and external threats, both accidental and deliberate ones. We consequently limit the expressibility of the code the user can deploy. In addition, OODIDA is designed to run on a corporate VPN.

First, in order to limit the damage the user can do, we enforce that the provided custom function takes a list of numerical values as input and returns either a list of numerical values or a numerical value. Verifying this property is carried out by the user-front end application before dispatching a piece of custom code for further processing. There are corresponding assertions on the on-board and off-board nodes as it is theoretically possible to manually specify custom code that sidesteps the checks in the front-end application. However, this would require knowledge of the implementation of those checks, some of which use randomly generated inputs and others static ones. Thus, we consider it highly unlikely that an antagonistic developer would be able to work around the checks in the user front-end. The user would have to reliably predict the input of the test cases and cover it with branching logic, which is triggered in that particular case, but not otherwise. Guessing randomized inputs is arguably not feasible. However, even with perfect knowledge of the implementation and the seed used for generating random data, an antagonistic user would be thwarted. The reason is that we programmatically ensure proper behavior of the custom function that is executed on the client via a try-except construct for error handling. This step also includes a verification that the returned values are indeed as specified. This would seem like duplicate work as this check is already performed in the front-end prior to deployment, but it addresses the case of an omniscient antagonistic user. Thus, this approach closes the previously mentioned loophole an antagonistic developer theoretically has on the client side. A caveat is that this relies on the provided function terminating. Yet, there is the issue of the halting problem, i.e. an antagonistic developer could write a function that never returns, thus wasting computational resources of the client. Some legitimate computations could take a significant amount of time, so it is not possible to distinguish between legitimate and antagonistic code on that metric alone. However, with a generous timeout clause, which is enabled via Python’s multiprocessing library, a custom function that exceeds a set threshold, regardless of whether the code is antagonistic or not, can easily be terminated. Therefore, we tackle the problem of antagonistic input sufficiently well from a practical perspective. Also, it has to be kept in mind that our system supports commercial users who can be assumed to be invested in the system being fully operational.

Second, there is the issue of cloud security. As has been pointed out, commercial cloud solutions are vulnerable [shaikh2011security, islam2016classification] and therefore require adequate responses [sabahi2011cloud]. Among others, there are vulnerabilities due to multitenancy, virtualization, and resource-sharing. As we were aware of those issues, we chose to sidestep them by executing OODIDA on a corporate VPN instead. This does not mean that network security is not an issue. However, this approach avoids additional threats that are unique to cloud computing. On a related note, the primary motivation behind this choice was the need to protect our data. In that regard, not relying on a third-party commercial cloud computing provider seemed like an obvious decision.

3.6 Complex Use Cases

The description of active-code replacement so far indicates that the user can execute arbitrary code on the server and clients, as long as the correct inputs and outputs are consumed and produced. What may not be immediately obvious, however, is that we can now even create ad hoc implementations of the most complex OODIDA use cases, an example of which is federated learning [mcmahan2016communication]. A key aspect of federated learning, compared to many standard types of assignments on our system, is that the results of one iteration are used as the input of the next one. The original implementation is discussed in the paper on OODIDA [ulm2019oodida]. With federated learning, clients update machine learning models, which the server uses as inputs in order to create a new global model. This global model is the starting point for the next iteration of training on clients.