Nonlinear System Identification: A User-Oriented Roadmap

Johan Schoukens; Lennart Ljung

arXiv:1902.00683·cs.SY·February 5, 2019

Nonlinear System Identification: A User-Oriented Roadmap

Johan Schoukens, Lennart Ljung

PDF

2 Repos

TL;DR

This paper provides an accessible overview of nonlinear system identification, guiding practitioners and researchers through fundamental concepts, differences from linear methods, and practical choices with examples and intuitive explanations.

Contribution

It offers a user-oriented roadmap that bridges linear and nonlinear system identification, emphasizing practical understanding and decision-making for diverse applications.

Findings

01

Guides practitioners in nonlinear system identification

02

Highlights differences between linear and nonlinear methods

03

Provides practical examples and intuitive explanations

Abstract

The goal of this article is twofold. Firstly, nonlinear system identification is introduced to a wide audience, guiding practicing engineers and newcomers in the field to a sound solution of their data driven modeling problems for nonlinear dynamic systems. In addition, the article also provides a broad perspective on the topic to researchers that are already familiar with the linear system identification theory, showing the similarities and differences between the linear and nonlinear problem. The reader will be referred to the existing literature for detailed mathematical explanations and formal proofs. Here the focus is on the basic philosophy, giving an intuitive understanding of the problems and the solutions, by making a guided tour along the wide range of user choices in nonlinear system identification. Guidelines will be given in addition to many examples, to reach that goal.

Tables2

Table 1. TABLE I : Evolution of the RMS error on the Tail validation data, the number of linear parameters n θ L subscript 𝑛 subscript 𝜃 𝐿 n_{\theta_{L}} , and the number of nonlinear parameters n θ N L subscript 𝑛 subscript 𝜃 𝑁 𝐿 n_{\theta_{NL}} as a function of the model complexity during the model reduction process.

	BLA	NLSS	Decoupled	Equal Branches	Single Branch
RMS error	12%	0.49%	0.40%	0.40%	0.40%
$n_{θ_{L}}$	5	5	5	5	5
$n_{θ_{N L}}$	0	30	12	6	3

Table 2. TABLE II : Systems/models with external or internal nonlinear dynamics

External nonlinear memory	Internal nonlinear memory
no nonlinear output feedback	nonlinear output feedback
NFIR	NARX
Volterra
Open loop block oriented	Closed loop block-oriented
NLSS lower triangular	NLSS full

Equations276

\overset{y}{^} (t ∣ θ)

\overset{y}{^} (t ∣ θ)

\hat{θ}_{N} = argmin_{θ} t = 1 \sum N ∥ y (t) - \overset{y}{^} (t ∣ θ) ∥^{2}

\hat{θ}_{N} = argmin_{θ} t = 1 \sum N ∥ y (t) - \overset{y}{^} (t ∣ θ) ∥^{2}

\overset{y}{^} (t) = a_{1} u (t - 1) + a_{2} u (t - 2),

\overset{y}{^} (t) = a_{1} u (t - 1) + a_{2} u (t - 2),

\overset{y}{^} (t) = h (u (t - 1), u (t - 2))

\overset{y}{^} (t) = h (u (t - 1), u (t - 2))

x (t + 1)

x (t + 1)

y (t)

y (t)

y (t)

= u (t)^{2} + (w (t)^{2} + 2 w (t) u (t)) + v (t) .

Z^{t} = u (s), y (s); s \leq t

Z^{t} = u (s), y (s); s \leq t

q (t) = F (p (t))

q (t) = F (p (t))

\overset{y}{^} (t ∣ θ, Z^{t -})

\overset{y}{^} (t ∣ θ, Z^{t -})

F (z (t), \overset{z}{˙} (t), u (t)) = 0

F (z (t), \overset{z}{˙} (t), u (t)) = 0

\overset{x}{˙} (t)

\overset{x}{˙} (t)

y (t)

y (t) = y (t ∣ u^{t}); u^{t} = {u (s), s < t}

y (t) = y (t ∣ u^{t}); u^{t} = {u (s), s < t}

x (t

x (t

w (t) is a sequence of independent

g_{w} (\cdot)

y (t)

v (t) is a sequence of independent

g_{v} (\cdot)

\overset{x}{˙} (t)

\overset{x}{˙} (t)

\overset{y}{^} (t ∣ θ)

x

x

\overset{x}{˙}

\overset{y}{^} (t ∣ θ, Z^{1 - 1}) = i = 1 \sum d w (p (t), p_{i}) \overset{y}{^}_{i} (t ∣ θ, Z^{t - 1})

\overset{y}{^} (t ∣ θ, Z^{1 - 1}) = i = 1 \sum d w (p (t), p_{i}) \overset{y}{^}_{i} (t ∣ θ, Z^{t - 1})

\overset{y}{^} (t ∣ θ, Z^{1 - 1}) =

\overset{y}{^} (t ∣ θ, Z^{1 - 1}) =

r = 1 \sum d w (p (t), p_{r}) [b_{1}^{(r)} u (t - 1) - a_{1}^{(r)} y (t - 1)]

x (t + 1)

x (t + 1)

y (t)

x (t + 1)

x (t + 1)

\overset{y}{^} (t ∣ θ)

x (t + 1)

x (t + 1)

\overset{y}{^} (t ∣ θ)

x_{j} (t) = f (x_{1, \dots, j - 1}, u (t), θ),

x_{j} (t) = f (x_{1, \dots, j - 1}, u (t), θ),

x (t)

x (t)

u (t - 1), \dots u (t - nb)]^{T}

\overset{y}{^} (t ∣ θ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Nonlinear System Identification

A User-Oriented Roadmap

Preprint submitted to IEEE Control Systems Magazine

Johan Schoukens1 and Lennart Ljung2

1 Dep. INDI, Faculty of Engineering, Vrije Universiteit Brussel, Belgium

1 Dep. of Electrical Engineering, Eindhoven University of Technology, The Netherlands

2 Reglerteknik, ISY, Linköpings Universitet, Sweden

Summary

The goal of this article is twofold. Firstly, nonlinear system identification is introduced to a wide audience, guiding practicing engineers and newcomers in the field to a sound solution of their data driven modeling problems for nonlinear dynamic systems. In addition, the article also provides a broad perspective on the topic to researchers that are already familiar with the linear system identification theory, showing the similarities and differences between the linear and nonlinear problem. The reader will be referred to the existing literature for detailed mathematical explanations and formal proofs. Here the focus is on the basic philosophy, giving an intuitive understanding of the problems and the solutions, by making a guided tour along the wide range of user choices in nonlinear system identification. Guidelines will be given in addition to many examples, to reach that goal.

Introduction

Nonlinear system identification is a very wide topic, every system that is not linear is nonlinear. That makes it impossible to give a full overview of all aspects of the field. For that reason, the selection of the topics and the organization of the discussion is strongly colored by the personal journey of the authors in this nonlinear universe.

Identification of linear dynamic systems started in the late 1950’s. Zadeh [1] put the need for a well developed system identification framework on the top of the agenda, followed by early overviews of the field [2]. Eventually a series of books established the field [3, 4, 5, 6, 7]. Linear system identification booked many successes, and data driven modeling became an enabling factor in modern design methods.

Nonlinear system identification comes into the picture where linear system identification [7, 6, 8] fails to address the users questions. The real world is nonlinear and time-varying, and in some applications these aspects can be no longer ignored (see Figure 1) so that linear models become imprecise, or do not reproduce essential aspects of the behavior of the system under test. This article is focused on nonlinear system identification. An overview of time-varying system identification is given in [9] and the references therein.

Nonlinear behaviour appears in many engineering problems. In mechanical engineering, nonlinear stiffness and damping, nonlinear interconnections, etc. are troubling ground vibration tests of airplanes and satellites, resulting amongst others in resonance frequencies and dampings that vary with the excitation level (see Figure 2). In telecommunication, power amplifiers are pushed in a nonlinear operation regime to improve the power efficiency. Distillation columns exhibit nonlinear dynamic behavior. Many biological systems (eye, ear, sense of touch) apply first a nonlinear compression, known as the Weber-Fechner law, in order to cover the very large dynamic range of the inputs. The human brain is governed by nonlinear relations between the neurons.

The need for nonlinear system identification goes far beyond the control application field. Nonlinear models are instrumental to get a basic understanding in very different problems like brain activity modeling, chemical reactions, … where researchers still struggle with the question: How does it work? In these applications, (high) accuracy is not always needed, qualitative results can be very helpful to isolate the dominant terms. So, structural model errors (that is, deficiencies in the chosen model structure) become more important than noise disturbances, and the system identification tools should be properly tuned to deal with dominating structural model errors. This shows that there are many different reasons to move from linear towards nonlinear models. It is very important to keep in mind the motivation of the researchers who developed nonlinear identification tools to relate the different methods. Without understanding the drive of the researcher/scientist it is often very difficult to understand and appreciate the choices they made.

The outline of the article is organized along a number of sections that cover the main flow of the nonlinear system identification process, supported by sidebars to provide additional information and background information on some topics. The article starts with a short discussion on the lead actors in (nonlinear) system identification. These are still the same as those identified in the early 1960’s by Zadeh [1]: the data, the model, and the matching criterion. These are discussed from a nonlinear identification perspective in the Section "Lead Actors in Nonlinear System Identification". Next, in the Section "Why is Nonlinear System Identification so Involved", it is clarified why the complexity of the modeling process grows very fast when moving from linear to nonlinear system identification. The goal of the nonlinear modeling process strongly affects the required user effort. This is discussed in the Section "Goal of the Nonlinear System Identification Process". Nonlinear system identification is much more involved than that linear identification is. For that reason it should be a well informed decision to move towards nonlinear methods. This is discussed in the Section "Linear or Nonlinear System Identification: A Users Decision". There are many more nonlinear model structures than there are for linear systems. Making a proper choice along this wide range of possibilities is one of the major difficulties for newcomers in the field. Guidance to make a proper choice is given from a systems behavior perspective (for example: hysteresis, chaos, fading memory, etc.), and from a users point of view (physical or black box model, model that is linear-in-the-parameters, etc.). This is discussed respectively in the Section "The Palette of Nonlinear Models" and the Sidebar “External or internal nonlinear dynamics”. Black box nonlinear models are more difficult to access and to understand than physical models (“Black Box Models Complexity: Keeping the Exploding Number of Parameters Under Control; Increased Structural Insight; Model Reduction”). Often it is hard to get intuitive insight in the modeled behavior because the number of terms in the model becomes very large. Eventually, some attention is paid to experiment design. A number of (extensive) sidebars provide more detailed background information and highlight some important aspects of the nonlinear identification process so that the main flow of the article remains very clear. For example, in the Sidebar "Retrieving Structural Information", methods are discussed to increase the insight and to reduce the number of terms by searching for hidden structural information in the raw nonlinear black box models. The Sidebar “Impact of Structural Model Errors” makes the reader familiar with the huge impact that structural model errors have on the best identification practices.

Many aspects are illustrated on examples, and supporting software is made publicly available. Guidelines for the user, to pinpoint the main lessons and conclusions are given at the end of most sections.

The Lead Actors in (Nonlinear) System Identification

Any identification procedure to build models using observed input-output data is characterized by three main components, the data, the set of candidate models, and the estimation method. To the lead actors should also be counted the process to gain confidence of the estimated model, the validation procedure. These four actors will be briefly presented in this section, while more comprehensive treatments will follow in forthcoming sections.

The Data

The input-output data that is used to select a model is the fundamental information source. To select the signals to be measured, to decide how the input should be configured, and to collect the data with appropriate sampling procedures will have a major impact on the quality of the resulting model. This is the task of Experiment Design. It is important to realize that no model can be a perfect description of the true system under investigation. Any model will be an approximation of the truth, and it will be affected by the aspects of the system that are excited during the experiment.

For that reason it is important to design the experiment to cover the intended use of the model. The power spectrum (e.g. white or colored noise) and the amplitude distribution (e.g. uniform, Gaussian, or binary) should be properly set. Of course the classical linear identification rules remain also valid (persistency of excitation, maximum Fisher information), but these should be balanced against the other requirements to identify a well behaving approximating model (see also "Impact of Structural Model Errors").

Experiment design is further detailed in the section with that title.

User guideline: Make sure that the experiment covers the domain of interest and brings out all essential system features of interest.

The Model Structure: the Set of Candidate Models

For linear identification the choice of model sets is quite easy to grasp: Settle for a state-space structure of certain dimension or transfer functions of certain orders. In contrast, the choice of a model set for nonlinear identification is a major problem and offers a very rich range of possibilities. It is driven by the users preferences and directed by the system behavior. In fact, aspects of model properties and considerations of model choices dominate this article. An overview of the users choices along possible and useful nonlinear model structures is given in the section “The Palette of Nonlinear Models” arranged by the amount of prior physical knowledge about the system that is incorporated in the model. In addition, the systems behavior imposes whether the nonlinearity should be captured in a dynamic closed loop or not. This may have essential impact on the behavioral patterns of the model, and is further discussed in the Sidebar “External of Internal Nonlinear Dynamics”.

A further important distinction is whether disturbance sources enter before the nonlinearity or not. In the former case, proper stochastic treatment of the model becomes more cumbersome and may require advanced tools. This is discussed in the Side Bar “Process Noise in Nonlinear System Identification.”

In any case, a model should be capable of producing a model output $\hat{y}(t)$ for the output at time $t$ based on previous input-output measurements. This could be computed as a formal prediction of the output, or it can be based on other considerations. The set of candidate models is typically parameterized by a parameter vector $\theta$ , and the notation

[TABLE]

will be used for the model output corresponding to the model parameter $\theta$ .

User guideline

The choice of the model structure is directed by behavior and structural aspects.

•

Behavior aspects are imposed by the system: Can the selected model reproduce the observed macroscopic behavior like shifting resonances, hysteresis, etc?

•

Structural aspects are a user choice that is set by the level of physical insight that the user desires to inject in the model, ranging from white box physical models to black box models.

•

Structural model errors in the model result if the chosen model structure is not rich enough to contain a true description of the system

The Estimation Method

With a given data set and model set, the identification task is to select that model ( $\theta$ ) that best describes the observed data. Most such estimation methods are based on a criterion of fit between the observed output $y(t)$ and the model output $\hat{y}(t|\theta)$ , which can conceptually be written as

[TABLE]

If indeed the model output is computed as a one-step ahead prediction based on the model set and data available at time $t-1$ , and the prediction error is Gaussian, then the model estimate $\hat{\theta}_{N}$ will be the Maximum Likelihood Estimate, MLE. But the conceptual method (2) can be interpreted also in more pragmatic terms without making a statistical motivation, see “Impact of Structural Model Errors”.

The cost function (2) can be extended with a regularization term to include prior knowledge, or to impose a desired behavior like smoothness or exponential decay to the solution (see also (S70) in “Black Box Models Complexity: Keeping the Exploding Number of Parameters under Control; Increased Structural Insight; Model Reduction”).

User guidelines: The criterion of fit can be based on a statistically grounded choice if the noise disturbances dominate over the structural model errors. If the latter dominate, a weighting function can be selected that reduces the impact of structural model errors in the domain of interest (for example using a user selected frequency weighting), at a cost of getting larger errors outside the domain of interest.

Model Validation

When a model has been estimated, the question to ask is “Does it solve our problem?” and/or is it in conflict with either the data or prior knowledge? This is the essential procedure of model validation, which is further discussed in the section with this title.

Often the decision is that the model is not “good enough”, so some choices have to be revised. Typically other model sets have to be tested or the conclusion might be that the data was not informative enough, so the experiment design must be reworked. This is the reason why identification often is seen as an iterative problem with an “identification loop”. See, e.g. Figure 17.1 in [6].

Model validation is further detailed in the section with that title.

User guidelines: Validating of a model is a rather subjective and pragmatic problem. Check on a rich validation data set that covers the intended use of the model if the estimated model meets the user expectations.

Why is Nonlinear System Identification so Involved?

Nonlinear system identification is experienced to be much more involved than linear identification. Three aspects contribute to this observation: i) Nonlinear models live on a complex manifold in a high dimensional space, while linear models live on a simple hyperplane that is much easier to characterize. ii) Structural model errors are often unavoidable in nonlinear system identification, and this affects the three major choices: the experiment design, the model selection, and the selection of the cost function. iii) Process noise entering before the nonlinearity requires new numerical tools to solve the optimization problem.

From Hyperplane to Manifold

A linear dynamic system is described by a linear relation between the lagged inputs and outputs, for example, for a simple two tabs FIR (finite impulse response) model,

[TABLE]

that is shown in Figure 3 (a). The output is confined to a hyperplane in the three-dimensional space. For a NFIR (nonlinear finite impulse response) model, the relation can become arbitrary complex,

[TABLE]

as shown in Figure 3 (b). The estimation task is to estimate this surface. It is clear that the linear hyperplane can be characterized with only a few points, while it is impossible to tell anything about the complex manifold outside the domain where the function is sampled.

This reveals immediately a number of issues in nonlinear system identification that are less pronounced or even not present at all in linear system identification. 1) Experiment design will be extremely important because it should be guaranteed that the full domain of interest is covered. Extrapolation of the model should be avoided at all cost, unless there is physical insight that provides a natural description of the manifold. 2) Finding the parameters that describe the manifold results often in a highly nonlinear optimization problem. Good initial values are needed to make sure that the global minimum is reached. In practice this is often impossible, and the user has to be satisfied with a good local minimum. 3) Because the manifold can be very complex, it is often not possible to propose a model structure that is flexible enough to reproduce it exactly. This leads to the presence of structural model errors. These affect the whole identification process.

Summary: Nonlinear systems are intrinsically more involved and complex than linear systems. This affects the experiment design and the model selection of the system identification process.

System identification in the presence of structural model errors

As explained in the previous section, it is very hard to avoid structural model errors in nonlinear system identification. In "Impact Of Structural Model Errors", a formal definition is given of structural and random model errors, followed by a discussion how to deal with structural model errors that dominate the noise disturbances. Special attention is paid to the user choice how to shape the structural model errors, and on the impact of structural model errors on the variance estimate of the model.

Summary: Be aware that there is a high risk for dominating structural model errors (over the noise disturbances) in nonlinear system identification which makes some classical choices and results of the (linear) identification theory invalid. Proper actions are needed to deal with this new situation.

Impact of process noise on the system identification problem

The output of a (nonlinear) system depends not only on the known or measured inputs, the system can also be affected by signals that are not known to the user. These unknown inputs $w(t)$ are called process noise. While the measurement noise $v(t)$ does not affect the evolution of the system, process noise does as is clearly seen in the state space representation of a nonlinear system:

[TABLE]

Process noise can have a structural impact on the behavior of a system because it affects also the system’s internal signals and not only the measurements.. To set the ideas, we can consider without loss of generality the simple static nonlinear system

[TABLE]

At the output $y_{p}(t)=w(t)^{2}+2w(t)u(t)$ now depends on the input and does not have typical noise properties. Its mean value is different from zero. Moreover, it is shown in (S59) that also the apparent gain of the system can change for odd nonlinearities.

This simple example illustrates that the presence of process noise increases the complexity of the system identification problem significantly. If the process noise enters before the nonlinearity, its effect on the output $y_{p}(t)$ is affected by the nonlinear operations so that it is no longer realistic to use the Gaussian framework to formulate the cost function. Unknown distributions that depend on the input and on the model parameters are faced. For that reason, the output error framework, that assumes that all noise enters at the output of the system has to be abandoned, and a generalized multivariate probabilistic framework is needed. The complexity of the methods to solve these problems goes far beyond the linear system identification methods and a completely new set of tools is needed. For that reason it is important to detect the presence of process noise that passes through the nonlinearity, and to select the proper tools when needed. This is discussed in detail in "Process Noise In Nonlinear System Identification" and “Identifying Nonlinear Dynamical Systems in the Presence of Process Noise”.

User guidelines: Check if process noise passes through the nonlinearity, and select the proper tools as needed.

Goal of the (Nonlinear) System Identification Process

The goal of the modeling effort strongly affects the complexity of the system identification process. Issues that need to be addressed are: 1) Simulation or prediction models, 2) Physical models or black box models, 3) Application driven models. All these aspects are briefly discussed below.

Models for simulation or models for control?

Prediction model: In layman’s terms, a prediction model estimates the output of the system one-step-ahead at time $t+1$ , using the measured input up to time $t+1$ , and the measured outputs up to time $t$ . Prediction models are central for modern control applications, where essentially the predicted output is controlled.

Simulation model: The alternative is that the measured outputs are not used at all, but the output is calculated from inputs only. This is called a simulation model, and it can be used to simulate the behavior of the system for new inputs. These models are useful to test what happens in new situations, to design systems and controllers, to mimic physical systems, etc.

It is much harder to get a good nonlinear simulation than prediction model. Simulation models can become unstable, and it is harder to get small structural model errors than it is for one-step-ahead prediction models. Even a very simple linear model can often provide a good one-step-ahead prediction if the sample frequency is high enough.

A detailed discussion is given in "Simulation Errors And Prediction Errors". Figure S8 shows results of linear and nonlinear simulation and prediction models for the forced Duffing oscillator.

User guideline: Do not spent effort to get a complex simulation model if a simple prediction model can do the job. However, keep in mind that a good prediction can fail completely to generate a reliable simulation.

Physical or black box models?

A physical model is built on a deep insight in the internal behavior of the system. Detailed physical descriptions are made at the level of the (microscopic) subsystems, and next they are linked together in a macroscopic model that is built on (thousands of) nonlinear (partial) differential equations. Alternatively, simple use of physical insight can be used to guide the model process. Eventually, in both cases, the model depends upon a number of parameters that can be obtained from dedicated measurements (for example, the friction coefficient of a wheel on the road). These models are highly preferred in the industry, but they can be very expensive to construct and difficult to use in real time control. For that reason, such models are often not affordable.

Black box models become very attractive when a physical model is too expensive to develop. Black box models describe the input-output behavior of a system, and are tuned directly from experimental data. Black box models are simple to use and can be applied in real time computations.

Of course there are many possibilities in between both extremes. These are discussed in "The Palette Of Nonlinear Models".

User guideline: Select the model level that best balances the need for physical insight - behavior insight and the expenses to build and use the model.

Models constrained for particular applications

The ideal model should cover all possible applications, providing good output simulations for all possible excitations. Of course this is an unattainable ideal, and a more restricted goal should be defined. This is done keeping the application in mind, the model should be able to cover those situations and signals that are important for the actual application, and not more than that [12]. If a system will be mainly driven by low frequent sine excitations, no effort should be spent to develop the model also for wide band random noise excitations. This can again significantly reduce the modeling effort.

User guideline: Select carefully the domain and application of interest and focus the modeling effort on it.

Linear or Nonlinear System Identification: a Users’ Decision

As discussed before, nonlinear system identification is much more involved than the identification of a linear system. The experiment design is more tedious, the model selection is much more involved, and the parameter estimation is more difficult. For that reason, moving from the well established linear identification tools towards the more advanced nonlinear identification methods is an important decision that significantly affects the cost of the identification effort (time, money, experimental resources) that should be well informed. Is a nonlinear model needed to reach the required model quality? Is the quality of the data good enough to improve the results of a linear identification approach? How much can be gained if a linear model is replaced by a nonlinear one? Often, additional information is needed to address these questions.

If it is possible to apply periodic excitation signals, a full nonparametric analysis can be made that requires no user interaction, while the experimental cost with respect to a linear study remains almost the same (see “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”). On the basis of the results, the user can detect the presence of nonlinearities, quantify their level, and find out if it are even or odd nonlinearities. With this information the user can make a well informed decision on what approach to use, and how much can be gained by switching from linear to nonlinear modeling.

Detection, separation, and characterization of the nonlinear distortions and the disturbing noise

In this article, only a basic introduction to nonlinear distortion analysis is given. A detailed theoretical analysis is given in [8], and illustrations on practical examples (fighter jet, diesel engine [13], industrial robot [14]) are discussed in [15]. In this article, the ideas are illustrated on the forced Duffing oscillator that is discussed in full detail in “Extensive Case Study: the Forced Duffing Oscilator”. A detailed introduction to the nonlinear distortions analysis is given in “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”.

Nonlinear distortion analysis: The basic idea is very simple and starts from a periodic input signal with period $T=1/f_{0}$ . Only a well-selected set of odd frequencies (odd means that $f$ is an odd multiple of $f_{0}$ ) is excited, all the other frequencies have zero amplitude (see Figure S3). This excitation signal is applied to the nonlinear system under test. Effects from even nonlinearities (simplest even nonlinearity is $y=u^{2}$ ) show up at the even frequencies, while odd nonlinearities (like $y=u^{3}$ ) are present only at the odd frequencies, see [15] and references therein. At the odd frequencies that are not excited at the input, the odd nonlinear distortions become visible at the output because the linear part of the model does not contribute to the output at these frequencies. By using a different color for each of these contributions, it becomes easy to recognize these in an amplitude spectrum plot of the output signal. This is illustrated in Figure 4. The forced Duffing oscillator (see Figure S6) is excited at different excitation levels, and the output is plotted for the excited frequencies, the even and odd nonlinearities, and the disturbing noise level. For small excitation levels, the nonlinear distortions are at the 10% level (-20 dB below the output), while for the high excitation levels the nonlinear distortions dominate the output.

Linear or nonlinear model? On the basis of this information the user can make a well informed decision. For example, a linear model can be used if it is known that only small excitations will be applied, and if 10% errors can be tolerated. However for the large excitation levels, this is no longer an option because the nonlinearities are too large. In that case a nonlinear model is needed.

Noise floor: In Figure 4, it is seen that he nonlinear distortions are more than 40 dB or a factor 100 above the noise level (see “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”), also called the noise floor of the measurements [16]. The noise floor is the maximum power level over a given frequency band in the frequency domain of components that are not due to the applied signal, harmonics, or spurious signals (this are signals that are out of the control of the user, for example a disturbance picked up from the mains). The noise floor is estimated by analyzing the variations over the periodic repetitions of the output. This shows that the quality of the data is very high. The distance between the nonlinear distortion levels and the noise floor is a measure for the potential improvement that can be obtained by using a nonlinear model. In this case a gain of a factor 100 is possible for the high excitation levels if a good nonlinear model can be obtained.

User guideline: Make a nonparametric distortion analysis whenever it is possible to apply periodic excitations. Use this information to decide if a linear or nonlinear approach is needed, and to check how much can be gained by turning towards nonlinear system identification.

Linear modeling in the presence of nonlinear distortions

Deciding to go on with the linear identification approach implies also the presence of structural model errors in the results. This may be an acceptable solution, but the user should fully understand the impact of the structural model errors on the validity of the results. This is discussed in "Impact of Structural Model Errors". The major conclusions are that i) the experiment should be tuned to the application (same class of excitation signals), and ii) no reliable theoretical uncertainty bounds can be provided. These should be obtained from repeated experiments with different excitation signals generated from the relevant class of input signals [17].

The problem of linear modeling in the presence of nonlinear distortions is studied and discussed in "Linear Models of Nonlinear Systems". The classical linear identification methods will lead to consistent estimates of the best linear approximation.

For random excitations, the nonlinear effects at the output will look very similar to noise, and it is very difficult for an unexperienced user to recognize their presence. Their power spectrum can be estimated using a parametric noise model that is simultaneously identified with the plant model (for example using a Box-Jenkins model, see e.g. [6]). The noise model can be used in a control design to make the disturbance analysis, but uncertainty bounds calculated from it are no longer valid because the errors are not independent of the input. Again there is no theoretical framework available today to solve that problem. The variance of the estimates should be obtained by repeating the experiment multiple times with a random varying excitation as is explained in "Impact of Structural Model Errors".

Summary: Linear models can be very useful, even in the presence of (strong) nonlinearities, because it is much easier to deal with it. Make sure to understand the impact of the nonlinear distortions on the best linear approximation.

Nonlinear system identification

A nonlinear system identification approach is justified if the preprocessing step indicates too high nonlinear distortion levels that are well above the noise floor of the data. It is this problem that will be further addressed in this article. What identification methods to use? How to select a model class? How to design the experiments? This are questions that will be addressed in this article. Again the user will have many options, starting from simple nonlinear models that are good enough to solve the problem, to complex models that include also the fine details that are deeply hidden in the data.

User guideline: Select nonlinear system identification only if there is enough evidence that linear models will not solve the problem.

The Palette of Nonlinear Models

The Multitude of Nonlinear Models

A major challenge in dealing with nonlinear system identification is that very many nonlinear model structures have been suggested. A user easily can be quite confused getting familiar with the different choices that are available and how to choose a structure that suits his or her particular situation. A clear perspective on this wide variety of choices is needed to make a well informed choice. Nonlinear model structures can be ranked along two axis that are directed respectively by the user’s preference and by the systems behavior.

•

Users preference: A first classification of the models is made in terms of how much prior knowledge about the system is used, [18]. Based on the familiar concept of black-box models for general flexible structures with no physical insights the user can then select from a whole palette of models of different shades of gray to delineate many approaches to common nonlinear models.

•

System behavior: An alternative to use varying degrees of structural physical insights, is to use behavioral aspects of the system: Does it, for example, show behaviors like chaos, shifting resonance frequencies and varying damping, or hysteresis, as discussed by Pearson,[19]. Then, the main selection is to include the nonlinearity in a feedback loop or not. Observe that this selection is not a free user choice, it is imposed by the system behavior. A detailed ranking along this line is discussed in “Static Nonlinearities” and "External Or Internal Nonlinear Dynamics".

•

Remark: There are many other dimensions in which the different approaches can be classified like “universality”, “computational effort”, or “suitability” to deal with unstable systems. These aspects are not further elaborated in this article.

The user should combine both aspects in the final selection of the model.

General Structure of Nonlinear Models

In general the measured system input and output at time $t$ will be denoted by $u(t)$ and $y(t)$ . All measured data up to time $t$ will be denoted by

[TABLE]

The models can be expressed in discrete time or continuous time. Most real life systems evolve in continuous time, but often discrete time models are preferred to simplify the numerical simulations. Linear systems can be perfectly represented by discrete time models for zero-order-hold (ZOH) excitations [6, 7, 4], but this approach cannot be generalized to nonlinear systems because the ZOH nature of the signals is lost inside the nonlinear system. Alternatively, the discrete time approximation can be done within the bandlimited setup [20] (see” Approximating a Continuous Time NLSS with a Discrete Time NLSS Model”). The discretization error can be made arbitrarily small by increasing the sampling frequency and/or the model complexity as shown in "Approximating a Continuous Time NLSS with a Discrete Time NLSS Model". Both type of models will be considered in this section. In the discrete time case, the time variable will simply be enumerated by time instances $t=1,2,3,\ldots$ (“constant sampling interval of one time unit”).

The general structure of all the nonlinear models can be put in the form

[TABLE]

with $p,q$ vectors that are built on the signals turning around in the model, like inputs, outputs, states, internal variables. The function $F$ is a static nonlinearity. The properties and parameterization of static nonlinearities are discussed in more detail in “External or Internal Nonlinear Dynamics”, which also illustrates how it can be used in various nonlinear dynamic models.

The bottom line is that a model is an expression that allows the computation of the next output $y(t)$ based on previous observations

[TABLE]

This model output will depend on a parameter vector $\theta$ that is used to parameterize the model class. The notation $Z^{t-}$ denotes that $y(t)$ is excluded. In discrete time it means $Z^{t-1}$ . Remark: if a direct term is needed in the model, the regressors $Z^{t-1}$ can be extended with $u(t)$ .

Following the discussion in “Simulation Errors and Prediction Errors”, the model output $\hat{y}(t|\theta)$ will be a simulation output if it only depends on past inputs, and the corresponding model is a simulation or output error model. If the model output also depends on past outputs, it is a predicted output and the corresponding model a prediction model. This term does not necessarily imply that it is based on correct probabilistic treatment of the stochastic signals involved in the model.

In the list of models that follows, it will be indicated how they comply with the general structures (8) and (9).

Summary The user selected model set should reflect both the observed behavior aspects (for example, the presence of chaos or shifting resonances requires a closed loop around the nonlinearity) which leads to a selection imposed by the system, and the available physical insight, leading from snow-white to pit-black models, which is a user choice.

Snow-white Models

In the end, a model of a dynamical system is a collection of mathematical expressions that relate signals and variables that characterise the system behaviour. Important players are

•

$y(t)$ the output signal(s) of the system. These are the signals that are primary interest to be modeled.

•

$u(t)$ : the input signal(s) to the system. These are measurable signals that affect the outputs. They may or may not be manipulated by user.

•

$w(t)$ disturbance signals. These are unmeasurable signals that affect the outputs. They are typically described by random processes.

•

$x(t)$ auxiliary signals that are used in the model description

So, a model is a collection of equations involving $y,u,w$ and $x$ . To be a useful model it must to be possible to infer something about $y$ from the other measured variables.

Deterministic Models

If no disturbances $w$ are present, Then it should be possible to compute $y(t)$ from previous values of $u(s);s\leq t$ . That means that the model is deterministic or a simulation model or output-error model.

Stochastic Models

If there are stochastic disturbances $w$ present, the outputs also become stochastic variables. Then a specific value to $y(t)$ cannot be assigned based on the other variables, but a stochastic characterization of it must be used instead, like its probability density, or mean value $\hat{y}(t)$ . Since $w$ are not observed, their values up to time $t$ have to be inferred from the values of the observations $y(s)$ and $u(s)$ , $s<t$ The model values of $y(t)$ will then be conditioned on these variables and the (conditional) mean $\hat{y}(t)$ will really be a prediction based on past values.

To deal with these computability questions, it is necessary to be more specific about the structure of the collections of equations involving the model variables.

Deterministic DAE Models

Introduce for simpler notation, the vector $z(t)=[y(t),x(t)]^{T}$ for the outputs, and auxiliary variables. Assume that there are no disturbances. When writing down the equations that correspond to the physical knowledge of the system, typically differential equations are used, in addition to algebraic relations among the variables. That means that the equations can be written as

[TABLE]

where $\dot{z}$ is the time derivative of $z$ (It is sufficient to consider first order derivatives, since higher order ones can be rewritten with the aid of extra $x$ -variables. Also if $\dot{u}$ were to appear in (10), $u$ can be included in $z$ and the basic form can still be applied.). This a Differential Algebraic Equation, DAE model. There is a clear conceptual relation between (10) and the general structure (8).

It is a normal case that (10) can be solved for $z$ for given $u$ , and most software packages, e.g. [21] for modeling systems contain DAE solvers. There is an extensive literature, e.g. [22], [23] that addresses this problem of solvability of (10).

Deterministic Statespace Models

If $\dot{z}$ can be solved explicitly from (10), it follows that

[TABLE]

which is a standard, nonlinear statespace description for the relationship from $u$ to $y$ . If an initial state $x(0)$ is given, this equation has as unique solution $y$ under general and mild conditions. $y(t)$ will be a function of past $u(s)$ ,

[TABLE]

Stochastic Statespace Models

With disturbances present it is customary (and necessary) to represent these as white noise $\nu$ filtered through certain filters and assume that

[TABLE]

To avoid intricate issues with continuous time white noises only the discrete time case is considered here.

With the state-space forms (12), (13) the links with the general predictor model structure (9) become clear.

The problem to find the conditional distribution or conditional mean $\hat{y}(t|t-1)$ given past input- output signals is the well known (nonlinear) prediction or filtering problem. In case $f$ and $h$ are linear and $\nu$ and $e$ are Gaussian variables, this is solved by the familiar Kalman Filter, [24]. In the general case, the filtering problem has no closed form solution, but there is recent progress for numerical solution in terms of particle filters, e.g. [25], [26],

A simplistic (but mostly erroneous) way to deal with the nonlinear filtering problem is to assume that all disturbances can be collected as white noise added to the output. Then the model will correspond to an output error model like (11) (see also “Process Noise in Nonlinear System Identification”).

Summary: The work to construct a snow-white model, sometimes called First Principles modeling or Mechanistic modeling is typically time-consuming and laborious, and does as such not involve any system identification. It is supported by software like the object oriented languages Modelica, [27] or Simscape, [28].

Off-white Models

In the snow-white modeling work, it typically happens that the model includes one or several physical constants whose numerical values are not known. If the values cannot be established by separate measurements, they have to be included as a parameter $\theta$ in the model. In the deterministic state space case the model then takes the form

[TABLE]

where $\hat{y}(t|\theta)$ signifies that it is the output corresponding to the specific parameter value $\theta$ . In the stochastic case, the symbol will mean the predicted output time $t$ based on the model with parameter $\theta$ . Since its value is unknown, the model is no longer snow-white but has become an off-white model. Such models are also known as grey-box models, e.g. [29], but as follows from the sequel, there are several shades of grey.

To build an off-white model is the same work as to build a snow-white model, so a considerable amount of work might be required. It can also be remarked that it may not be possible to retrieve all the physical parameters from an identification experiment.

The off-white model (14) is a clear cut case of the general predictor structure (9). It also complies with (8) by taking $F$ to be solver of the underlying nonlinear differential equation.

An application of off-white model identification to a cascaded tanks system is given in (29) in Section “Examples”.

Summary: Off-white model identification requires substantial modeling work. It is important to realize that any deficiencies in the physical model may cause corruption in the physical parameter estimates.

Smoke-grey models: Semi-physical Modeling

By semi-physical modeling is meant doing physical modeling with a more leisurely attitude to the physics.

It could be a matter of using qualitative reasoning rather than formal equations. Take for example a voltage controlled DC motor, with input applied voltage $u$ and output motor shaft angle $y$ , a known load disturbance torque $L$ also acts on the shaft. Well known physical laws tell us that the applied voltage to the rotor circuit is split between the internal resistance in the rotor winding and the back emf resulting from the motion of the winding in the magnetic field. The latter is proportional to $\omega$ , the rotational speed. The torque $T$ from the magnetic field is proportional to current in the winding and the resulting torque on the shaft is $T$ - $L$ minus frictional torque, which typically proportional to $\omega$ . The resulting torque will by Newton’s law of motion be proportional to the rotor acceleration $\dot{\omega}$ . All this means that the voltage $u$ will be a linear combination of $\omega,\dot{\omega}$ and $L$ . Since $\omega$ is the derivative of $y$ it follws that

[TABLE]

This is of course the same model that would have been obtained with careful “white” modeling with the difference that $\alpha,\beta$ would have been expressed in physical constants of the motor, like internal resistance, friction coefficient, rotor moment of inertia, and magnetic field characteristics. That is not really essential, since the only physical constants that can be retrieved from input-output data are the combined expressions $\alpha,\beta$ .

Another useful form of semi-physical modeling is finding nonlinear transformations of the measured data, so that the transformed data stand a better chance to describe the system in a linear relationship. To give a trivial example, consider a process where water is heated by an immersion heater. The input is the voltage applied to the heater, and the output is the temperature of the water. Any attempt to build a linear model from voltage to temperature will fail. A moment’s reflection tells us that it is the power of the heater that is the driving stimulus for the temperature: thus let the squared voltage be the input to a linear model generating water temperature at the output. Despite the trivial nature of this example, it is good to keep it as a template for data preprocessing. Many identification attempts have failed due to lack of adequate semi-physical modeling. See, e.g., [6], Examples 5.1 and pages 533 - 536 for more examples of this kind.

Also nonlinear re-calibration of time scale can be counted to the family of non-linear transformations of measured data. Several systems have a natural time-maker. It could be a rotational system where the angle of the rotation is a natural time unit or it could be a flow system where the accumulated amount of transported substance gives a natural time flow (see the example “Buffer Flow System”).

The perhaps most common and important application of semi-physical modeling is to concatenate known submodels A simple example is the DC case above, which is a concatenation of a DC motor model from $u$ to $\omega$ and an integrator. But having libraries of basic element models from which more complex models are built up using simple physics and logics is now perhaps the most common way for modeling in many application areas. Modeling languages like Modelica are based on this principle, and extensive Modelica libraries exist for most applications.

User guideline: It is always important to think over the physics of the system to be identified, even if a complete off-white model is not constructed. Such semi-physical modeling can give insights into important non-linear transformations than can be essential components in the model.

Steel-grey Models: Linearization Based Models

Linear Models of Nonlinear Systems in Identification

It is well known how a nonlinear model like (14) can be linearized around a stationary point $x^{*}$ (see “Linear Models of Nonlinear Systems”). In fact, using linearized models may be the most common way to deal with nonlinear systems in practice. Identifying a linearized model is normally not done by going through the linearizing differentiations (since the nonlinear models is typically unknown). Instead one simply postulates a linear model structure and applies normal linear system identification. If the excitation keeps the system in a close vicinity of the stationary point $x^{*}$ , the identified model will be close to the “stationary point-linearization”. In general, the identified model will be a “stochastic linearization (BLA)” reflecting the input and output signal spectra of the identification data. In any case, the usefulness of the model may be limited since it only describes the system in the vicinity of $x^{*}$ .

Local Linear Models

An obvious remedy to the local nature of the linearized model is to work with several linearized models and connect them in some way to cover the global behavior of the system. The many ways to do this have lead to an extensive literature on local linear models [30] (several other terms used as well).

The basic idea is to divide the “state space” into Regions in within each of which a linear model is used to describe the system. To fix ideas, around a common special case, use a collection of measurable “regime points” $P=\{p_{i},1=1,\ldots,d\}$ which define centers of the regions. Each point is like a stationary point $x^{*}$ of a statespace model (but needs not be formally defined like that). Each of these points is associated with a linear model, generically characterized by its output prediction $\hat{y}_{i}(t|\theta,Z^{t-1})$ . The type of linear model could be arbitrary, and some examples are given below. When the system is at the regime point $p_{i}$ there is a clear linear model for predicting the output, At other points $p(t)$ the prediction can be interpolated from the values in $P$ by

[TABLE]

This is the archetype of a local linear model. This is explicitly of the form (9). In terms of (8) the nonlinearity hides in the weights $w$ and the shifts between linear models that they represent.

For a concrete case the following items need to be specified

What are the measurable regime points? 2. 2.

How is the collection $P$ determined? 3. 3.

What type of linear models $y_{i}(t|\theta,Z,^{t-1})$ are used? 4. 4.

How to choose the interpolation rule $w(p,p_{i})$ ?

Some comments:

1: Regime points

They are often naturally defined by the application, and may typically be part of the state vector. They correspond to operation conditions that are known to give well defined behavior. Typical cases could be fluid levels in flow systems applications and speed and altitude in flight applications. The regime points can also be determined from data, like a tree-based construction in LOLIMOT, [30].

3: Linear Models

. There is a wide range of linear model types available. The simplest one is ARX-models, that predict the output as a linear combination of past inputs and outputs:e.g. $\hat{y}(t|\theta)=-a_{1}y(t-1)+b_{1}u(t-1)$ in a first order case. Associating the $r^{th}$ regime point model with parameters with superscript $(r)$ the complete model (16) will be

[TABLE]

which is a linear regression with parameters $\theta=[a_{1}^{(r)},b_{1}^{(r)},r=1,\ldots,d]$ . There is a vast literature on multiple and local linear models, like [30], [31], [32]. The latter article uses fuzzy sets for the model interpolations.

LPV Models

A related concept is that of Linear Paramater Varying, LPV, models. In state space form they can be described as

[TABLE]

(correspondingly in continuous time). Disturbances can also be added. $p(t)$ is a measured regime variable. Formally (18) is not a non-linear system, but rather a linear, time-varying system. But if $p(t)$ in some way depends on the state $x$ , this can be a handy way of dealing with a nonlinear system. There is an extensive literature on dealing with and identifying LPV models, e.g. [33], [34], [35]. If $p(t)$ only assumes a finite number of different values, (18) is really a collection of local linear models, and several identification schemes can be based on the ideas in the previous subsection, and/or on handling time-varying linear systems.

An important difficulty for LPV systems is keeping track of the state space basis when $p(t)$ is changing.

Summary: Using linearlzation is a standard tool to handle nonlinear physical systems. Many possibilities exist to glue together linearized pieces into a good nonlinear model.

Slate-grey Models: Block-Oriented Models

A common and useful family of models is obtained by concatenation blocks of two types:

•

Linear Dynamic Models: $y=G(s)u$

•

Static Nonlinearities: $y(t)=f(z(t))$

Many such combinations have direct physical interpretations, like (See Figure 5)

•

The Wiener model: A linear model followed by a static nonlinearity $y=f(G(s)u)$ , describes linear plant with a nonlinear output sensor.

•

The Hammenstein model. A static nonlinearity followed by a linear system, $y=G(s)f(u)$ depicts a linear model controlled via nonlinear, say saturating, actuator.

So, such block-oriented models have a flavour of the smoke-gray, semiphysical models, constructed by simple engineering insights of qualitative nature.

But beyond that, block-oriented model possess interesting approximation properties. It is for example known, [36], that the parallel Wiener model in Figure 5 with sufficiently many linear branches, can arbitrarily well approximate fading memory systems $u\to y$ . The convergence rate can be improved by switching to parallel Wiener-Hammerstein models [37]. So, block oriented models can be used for many nonlinear systems in appropriate configurations without any physical interpretation and achieve good modeling results. This is much in the spirit of black-box models (next subsection). That motivates giving block-oriented models a darker shade of gray than “smoke-gray”. See Figure 21, for an application that is inspired by such simple reasoning but also benefitting from the extra flexibility by the input nonlinearity.

The block-oriented models clearly fit the general structure (8) with linearly manipulated model signals in combination with static nonlinearities. To find the predictor (9), it is necessary to follow the signal flow through the linear and nonlinear parts. If there is process noise a correct calculation of the predictor may require the use of advanced statistical techniques, like particle filters, [38](see also “Identifying Nonlinear Dynamical Systems in the Presence of Process Noise”.

The estimation of a block-oriented model basically follows the general minimization of fit between model (predicted) output and measured output. But before that can be done, considerable work may have to be done to come up with the structure and initial parameter values. The concept of the best linear approximation (see “Linear Models of Nonlinear Systems”) is useful in this context. A comprehensive account of estimation techniques for block-oriented model is given in the survey[39, 40].

The Hammerstein, Wiener, Hammerstein-Wiener and Wiener-Hammerstein, including the parallel structures, are all examples of nonlinear systems with external nonlinear dynamics (see “External or Internal Nonlinear Dynamics”): the nonlinear blocks are not captured in a dynamic feedback loop. The Wiener-Hammerstein feedback and the LFIR feedback systems are both nonlinear systems with internal nonlinear dynamics because the nonlinear block N is part of a dynamic closed loop. Dealing with multiple branches [41] and feedback structures [42] turns out to be very challenging.

Summary: Block-oriented models form a powerful and intuitive tool to handle nonlinear systems. It is often useful to try a simple Wiener or Hammerstein model to see if nonlinear model components show significant improvements over linear models.

Black models: universal approximators

So far, the starting point has been some kind of physical or behavioral aspect of the system when constructing the model. But in the end, the model – the Predictor – is a mapping from past input-output data to the space where the output lives. Lacking insights into the system the focus could be on building general, flexible such mappings that are universal and effective approximators of any reasonable predictor function. That is the idea behind Black (box) Models.

A General Structure of the Mapping

A general way to generate very flexible mappings from $Z^{t-1}$ to $\hat{y}$ is to construct a state $x$ from past input output data, and let the predictor be a general function of this state:

[TABLE]

where $f$ and/or $h$ are flexible static nonlinear functions of their arguments $x(t),u(t),y(t)$ . The possibilities of parameterizations to reach flexibility are discussed in general terms in “Static Nonlinearities”.

The model (19) is directly of the general predictor form (9) and links to the structure (8) through the nonlinear maps $f$ and $h$ .

Working with both $f$ and $h$ may be too general, and two cases will be further discussed.

Unknown State Transition Function: nonlinear state space models (NLSS)

Perhaps the most natural general black box approach is to postulate a general statespace model like (14) in discrete time (see also “Approximating a Continuous Time NLSS with a Discrete Time NLSS Model”):

[TABLE]

If state transition function $f$ is not known, it could be parameterized with $\theta$ e.g. as a flexible basis function expansion (S3).

Similar expansions can be applied to $h$ . In some cases, the state $x$ is measurable and the $h$ would be known. In such cases, [43] has applied a Gaussian Process model to $f$ which corresponds to a basis expansion in terms of eigenfunctions associated with the kernel (the covariance function for the Gaussian Process).

Nonlinear state space models with internal and external dynamics In “External or Internal Nonlinear Dynamics” the character of the nonlinear dynamics is discussed. This reflects whether the signals are fed around a nonlinearity or not. For a NLSS that property depends on the structure of $f$ :

If $f$ is lower triangular,

[TABLE]

the states can be solved for explicitly from $u$ and $\theta$ . Then it is possible to write (20) in the form $\hat{y}(t|\theta)=h(u(t),\ldots,u(t-n_{b}))$ which is a system with external nonlinear dynamics (actually an NFIR model). Observe that in (21), the state $x_{1}(t)$ does not depend upon other states, it only depends on $u(t)$ .

If the nonlinear state-space cannot be written under the lower triangular form (21), it is in general not possible to solve the equations explicitly as a function of $u(t)$ , and the function (20) becomes a system with internal nonlinear dynamics. For these systems, there will be at least one state $x_{i}(t)$ that is a nonlinear function of the past values of some of the other states. This can be interpreted as the presence of a feedback of an unobserved output of the system.

NARX Models

A special case of (19) is the NARX model, where the state is constructed as a finite number of past inputs and outputs

[TABLE]

If $h$ is a linear function, this predictor is the familiar simple ARX structure for a linear model. But as indicated, a general nonlinear static function $h$ that can be expressed e.g. in basis function expansion (S3). This structure is therefore known as NARX (nonlinear ARX). If $na=0$ it is known as an NFIR (nonlinear FIR)

NARX models are a very common class of nonlinear models and can describe a large class of nonlinear systems [44, 45]. However, they are not as general as the nonlinear state-space models discussed before. For example the nonlinear system $y(t)=(G(q)u(t))^{2}$ , cannot be represented in an input-output presentation (since the even nonlinearity $x^{2}$ cannot be inverted).

NARX models come in many different shapes, depending on how $h$ is parameterized. See “Sidebar: Static Nonlinearities”. They include Volterra systems, (S13), Neural Networks (S9), Gaussian Processes [46, 47], as well as custom made, semiphysical models, (S4).

The case where the nonlinear function $h$ is written as a linear combination of known basis functions, (S3), [48, 49], simplifies the identification problem to a linear regression. Then no iterative optimization procedure is needed [6, 8]. This is one reason why NARX models are very popular and have been successfully applied to many industrial problems.

The number of terms $M$ of a NARX model with basis expansion (S3) may grow very fast with the memory length. Special model pruning methods have been developed to keep only the most dominant terms in the model, e.g. [44].

User guideline

Lacking physical insights it may be necessary to use black-box model structures. Many flexible and useful such structures exist. But keep in mind that they all have a strong curve-fitting flavor and may not pick up any intrinsic system features. They basically reflect the properties of the estimation data which must be chosen with great care in these cases.

Pit-black models: nonparametric smoothing

So far models have been described essentially in parametric and analytic terms. But there is also another possibility: A pit-black model takes a “geometrical” view on the observed data set and the model construction. This approach is outside the scope of the current survey, but a few basic facts can be provided to make the model discussion more complete:

The model is a relation between the predictor and the output, i.e. between all past observations and the observed next output. Denote by $\varphi(t)$ an $n$ -dimensional vector of relevant representation of past observations (it is like the the state in (19)) Then the model is a relation

[TABLE]

Here $g$ will act like the predictor in the previous sections, so it corresponds to the general model structure (9). But instead of focusing on estimating $g$ by some parameterization, the whole data record can be viewed “geometrically”. Assume that $y(t)$ is scalar. Then each value pair $y(t),\varphi(t)$ is a point in a $n+1$ -dimensional space $Z$ . So the data set is a “point-cloud” in this space. In that perspective the modeling task is to find a surface $g$ in $Z$ that as well as possible describes this cloud. See Figure 6.

This can be accomplished by various ways to “smooth” the raw data cloud. The basic assumption is that the model surface is “smooth”, i.e. that $g(\varphi_{1})$ and $g(\varphi_{2})$ are “close” if $\varphi_{1}$ and $\varphi_{2}$ are. This should mean that if the observed points $[y_{i},\varphi_{i}],i=1,\ldots,N$ are available an estimate $\hat{g}(\varphi_{*})$ could be constructed for any point $\varphi_{*}$ from observed $y_{i}$ at neighbouring points:

[TABLE]

where the kernel $w$ weights the value of the observation $y$ by its distance to the sought regressor point $\varphi_{*})$ . Such smoothing kernels and nonparametric estimates $\hat{g}(\varphi_{*})$ are discussed extensively in the statistical literature, e.g. [50],[51]. In the control literature, they have been discussed under the name of “just-in-time-models”, since the estimate at point $\varphi_{*}$ is constructed from raw measured data only when it is requested.

A very simple, and common approach is to let $w$ be zero except for the $\varphi(t)$ that is closest to $\varphi_{*}$ in the data base. That makes the estimate $\hat{g}$ equal to its nearest neighbour. Many other kernels have been suggested and studied and a commonly used one is a parabola bottom, turned upside down, the Epanechikov kernel, [52].

[TABLE]

where $(\cdot))_{+}=\max[\cdot,0]$ and $C$ is a normalization constant. Another approach to selecting the weights $w$ in (24) is Direct Weight Optimization, DWO, e.g. [53]. Here $w$ is selected so that an upper bound of the error in the estimate $\hat{g}(\varphi_{*})-g(\varphi_{*})$ is minimized using a quadratic programming technique.

Manifold Learning

Another approach to this “geometrical” construction of linear models is to find lower dimensional manifolds in the cloud where has a clear concentration of points.

A linear model version of this idea can be used for illustration: in the “data cloud” setup a model is a hyperplane in the data space. Then many techniques are well known to fit the hyperplane to data, including principal component analysis (PCA) for finding essential linear subspace descriptions. That means that the modeling can be concentrated to the selected subspace with lower dimensional models. The nonlinear counterpart is to find a lower dimensional manifold $\Upsilon=r(Z)$ and express a model in terms of the lower dimensional image of the variables $\varphi$ under this mapping. Finding such a manifold is a challenging problem that has been discussed in an extensive literature under the name of Manifold Learning, e.g. [54] (Isomap), and [55] (Local Linear Embedding).

Summary: Pit-black models constructing model predictors directly from data, not employing explicitly a parameterized model is an interesting option for nonlinear identification that has not been used that much in the control community.

Experiment Design

The experimental data is the fundamental information source for the data driven modeling process. Practical (easy access to a nonparametric noise and nonlinear distortion analysis to guide the model selection process) and theoretical concerns (maximize the information in the data with respect to the selected model structure) should be addressed during the design of the experiment. This leads directly to the following guidelines:

•

Practical concerns: Use periodic excitations whenever it is possible, because these give a direct access to a nonparametric distortion analysis, without any user interaction (see “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”).

•

Theoretical concerns: Design the (amplitude) distribution and power spectrum of the excitation to maximize the information with respect to the parameters that need to be estimated in the selected model structure. Keep in mind that this is still no guarantee that the full domain of interest is covered (see for example Figure 7).

•

Warning: Because structural model errors often dominate the noise induced errors, it is necessary to select excitations signals that reflect the later use of the model in order to keep the structural model errors ‘small’ in the domain of interest (see Section "System identification in the presence of structural model errors" and “Impact of Structural Model Errors”).

Design of periodic excitation signals

The most simple periodic excitation is

[TABLE]

The period of this signal is $T=1/f_{0}$ . In most experiments, $u(t)$ is generated and processed in discrete time $t=lT_{s}$ , with $l=1,\ldots,N$ and $T_{s}=1/f_{s}$ the inverse sample frequency. The sample frequency and period length are ‘matched’ to each other by chosing $T=NT_{s}$ , so that $N$ samples fit exactly in one period of the signal. This relates also the frequency $f_{0}$ to the sample frequency $f_{s}$ by $f_{0}=f_{s}/N$ .

More general periodic signals are represented by their Fourier series as the sum of harmonicaly related sines and cosines with frequencies at integer multiples of $f_{0}$ :

[TABLE]

Such a signal is called a multisine [56], and has a period $T=1/f_{0}$ and frequency resolution $f_{0}=f_{s}/N=1/T$ . The amplitude spectrum $U_{k}$ , the phases $\varphi_{k}$ , the number of excited frequencies $F$ , and the frequency resolution $f_{0}$ are user choices that define the periodic signal. These can be set by the following guidelines (see also “Nonparametric Noise and Distortion Analysis Using Periodic Excitions”):

User guidelines to design a multisine [57, 8, 58]:

•

Spectral resolution $f_{0}=f_{s}/N$ : should be chosen high enough so that no sharp resonances are missed [59].

•

Period length $N$ : is set by $T=1/f_{0}=N/f_{s}$ . A higher frequency resolution requires a longer measurement time.

•

Amplitude spectrum $U_{k},k=1,\ldots,F$ : should be chosen such that the frequency band of interest is covered.

•

Phases: Use random phases, mutually independent for $k\neq l$ , and $E\{e^{j\varphi_{k}}\}=0$ . For example, select $\varphi_{k}$ from a uniform distribution on the interval $[0,2\pi)$ . For this choice, it follows from the central limit theorem, that $u(t)$ is asymptotically Gaussian distributed for $F\rightarrow\infty$ .

•

Signal amplitude: should be scaled such that it also covers the input amplitude range of interest.

•

Length of the experiment: At least one, and preferably a few, for example 3, periods should be measured.

•

If possible, repeat the experiment with a new realization of the random phase multisine and average the results over the multiple realizations to get improved estimates of the nonlinear distortion levels. The additional data sets are also very valuable for model validation on different but very similar excitations, and the generation of more reliable uncertainty bounds in the presence of structural model errors (see also “Impact of Structural Model Errors”).

Remark: By optimizing the phases, it is possible to create randomized signals with a user controlled amplitude distribution and power spectrum [60]. For example, signals with a uniform amplitude distribution that excite a specified frequency band can be generated.

Experiment design: most informative experiment

The goal of the experiment design can be formalized as a procedure to obtain the minimum required information needed to reach the modeling goals at the lowest experimental cost (time, power consumption, disturbance of the process, etc.).

Originally, optimal experiment design was completely focused on maximizing the information content of the experiment [61], quantified by the ’Fisher Information Matrix’ $M$ [6, 7, 8] that is directly linked to the smallest possible covariance matrix of the parameter estimates $P_{\theta}=M^{-1}$ . A scalar measure, for example, the determinant $det(M)$ or the trace $tr(M)$ , is used to quantify the information in the experiment.

Although an optimal experiment design is not always done for each experiment, it is very useful to make the exercise on a number of typical problems to be studied, because it provides a lot of intuitive insight in what turns an experiment into a good one. On the basis of that experience the quality of the experiments can be significantly boosted, even without designing explicitly an optimal input.

For models that are linear-in-the-parameters, the information matrix $M$ does not depend on the actual parameter variables. This does not hold true if the output of the model is a nonlinear function of the parameters, in that case $M(\theta_{0})$ depends explicitly on the true but unknown parameters $\theta_{0}$ . Often, the true parameters $\theta_{0}$ are replaced by an estimate $\hat{\theta}$ obtained from an initial experiment.

The more structured a model is, the higher the gain that can be obtained by a specific optimally designed experiment. The optimal experiment for a first order linear system, described by its nonparametric impulse response representation $g(k),k=1,\ldots,n$ is a white noise excitation. If the same system is represented by its transfer function model $G(s)=\frac{1}{1+\tau s}$ , the optimal experiment is a single sine excitation at a frequency $f=1/\tau$ [62].

Optimal input design for linear systems

Optimal input design for linear systems is fully understood today. In the basic problem, an excitation is designed that maximizes the determinant of the information matrix $det(M)$ . The optimal design minimizes the normalized variance of the estimated transfer function and is retrieved by solving a convex optimization problem. Since the problem depends only on the second order properties of the input signal, its solution is given by an optimal power spectrum of the excitation signal [61, 62, 63]. The actual shape of the signal (amplitude distribution) does not affect the information, but practical constraints can have a strong impact on it, leading for example to signals ranging from filtered white noise (having a Gaussian distribution), to binary excitations with a user imposed power spectrum.

Soon it became clear that this simple problem statement does not meet the full user needs. The optimal experiment design should be plant friendly (not all excitations are acceptable for the operators) [64]. Moreover, the uncertainty on the estimated model should be tuned to result in a (control) design that meets the global goals of the project. These initiated a search of application oriented input design [65, 66, 67, 12, 68], resulting in a design that pushes the uncertainty in a direction where it does not hurt the quality of the application: the uncertainty ellipsoid of the model is matched to the contour plots of the application cost.

Optimal input design for nonlinear systems

While the optimal input design for linear systems is well understood, it is much harder to provide general guidance of the optimal input design for nonlinear systems. Although it is possible to retrieve (numerically) the Fisher information matrix for nonlinear systems, it is very difficult to interpret these equations and to translate them into an optimal input [69]. This is due to the dependence of the Fisher information matrix $M$ on the higher order moments of the input for nonlinear systems. This leads to the design of the multivariate probability distribution of the input (for all moments and all lags) [70, 69] which is a highly non-convex problem.

In the case of a static nonlinear system that is linear-in-the-parameters, the problem becomes convex again. The solution depends only on the amplitude distribution of the excitation and results in a signal that is concentrated around a discrete set of excitation levels. The order of the samples does not influence the solution so that the power spectrum is completely free in this case.

The optimal input design for nonparametric impulse response estimation (resulting in a white noise excitation) can be generalized to nonlinear systems using the nonparametric Volterra representation. It leads to a design that combines the properties of the linear impulse response design (white noise excitation) with that of optimal input design for static nonlinear systems (discrete set of amplitude levels) [71]. This idea was also the starting point for a numerical design, leading to numerical procedures that make a brute force search for discrete level signals [72, 73].

Solving the full optimal experiment design problem for nonlinear systems is tackled today using brute force numerical optimization methods. The solutions can be generalized and simplified using a proper normalization of the problem. The choice of the signal constraint (power constraint at the input or output, amplitude constraint at the input, output, or at an intermediate signal) turns out to be most important. A natural choice seems to restrict the amplitude range at the input of the nonlinear sub-system [74].

In some problems, the interest is in the identification of a single parameter in a nonlinear model. Its variance can be reduced using a well designed feedback law that can be applied in real time [75].

Summary

•

Optimal input design that is solely based on the Fisher information matrix $M$ and its related variance expressions should only be applied if there are no dominant structural model errors present.

•

For linear systems, optimal input design is a well developed field and the nature of the solutions is fully understood, even if numerical procedures are needed to calculate the optimal input.

•

Optimal input design is closely linked to the intended application.

•

For nonlinear systems, numerical procedures are available (for some nonlinear model structures). A full understanding of the important aspects of a good solution is still lacking.

Model Validation

Model validation addresses the question “Does the model solve our problem?”, and/or is it in conflict with either the data or prior knowledge? A number of linear and nonlinear validation tools are discussed below.

Cross Validation

One of the most common and pragmatic tools for model validation is Cross Validation, that is to check how well the model is able to reproduce the behaviour of new data sets —Validation Data — that were not used to estimate the model. One way is to use the input of the validation data to simulate the model, to produce a simulated model output $\hat{y}_{S}(t)$ and compare how well this model output reproduces the output $y(t)$ of the validation data. The comparison could simply be a subjective, ocular inspection of the plots, to see if essential aspects of the system (for the intended application) are adequatly reproduced.

The comparison can also be done by computing numerical measures of the fit between the two signals. These are naturall based on the distance between $y(t)$ and $\hat{y}_{s}(t)$ . A common numerical measure is the fit, used in the system identification toolbox, [76].

[TABLE]

So the fit tells, in percent, how much of the variation of the output is correctly reproduced by the model.

For models that contain integration or are used for control design it may be more revealing to evaluate the model’s prediction capability. Then the $k$ -step ahead predicted output for validation data $\hat{y}_{p}(t|t-k)$ is computed using the model ( $\hat{y}_{p}(t|t-k)$ that depends on all relevant past input, and the output up to time $t-k$ ). The prediction can then be compared with the measured validation output by inspecting the plots or by the fit criterion (28).

Apart from these simple simulation and prediction applications, cross validation can be used in several sophisticated ways, discussed in the statistics literature, see e.g. [77].

Nonparametric validation for periodic excitations

Using periodic excitations gives access to a nonparametric noise and distortion analysis (see “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”). Adding the residues $y(t)-\hat{y}(t)$ to this plot shows how well the model captured the nonlinear contributions (how far are the validation errors below the nonlinear distortion levels), and if the errors drop to the noise floor.

Actions

•

The error level is above the noise floor: Structural model errors are detected. The user should decide if these errors are acceptable or not.

•

Nonparametric analysis of the errors: If the special periodic excitations of “Nonparametric Noise and Distortion Analysis Using Periodic Excitations” are used in the validation test, it is possible to find out the nature of the dominant errors (even or odd nonlinearities are missed in the model), giving indications how the model can be improved.

Linear validation tools

Linear validation tools check the whiteness of the residuals (auto-correlation test) and verify if no linear relations between the input and the residuals are left (cross-correlation). Although this are both second order moment tests that reveal only a subset of the possible problems (the higher order moments are not tested), it still provides valuable information.

Actions

•

Cross-correlation detected: A more flexible linear part of the model can reduce the linear dependency at a low cost.

•

Auto-correlation detected: The residuals are still colored. Using a linear noise model it is possible to reduce the prediction errors. This can improve the efficiency of the estimation procedure.

Nonlinear validation

Higher order moment tests

A full validation of a nonlinear model requires also the higher order moments to be “white” (strictly spoken this term applies only to the second order moment), and no higher order cross-correlations should exist. In practice these tests are often not made for the following reasons: i) Moments of order $n$ are $n-1$ dimensional objects. Visual evaluating these at all lags becomes very cumbersome and time consuming; ii) The required experiment length to estimate the higher order moments with a given precision grows very fast with the order $n$ , making such a test often unfeasible.

For some dedicated problems, higher order tests are proposed to detect the presence of nonlinearities [78].

Change the nature of the input

The behavior of a nonlinear system strongly depends on the nature of the excitation signal, even if the maximum input amplitude and power spectrum remain the same. For that reason, it is a very strong requirement for a model to cover a very wide range of input signals which can be tested during the validation. In Figure S9 it is shown on the Duffing oscillator that the nonlinear NARX model that was tuned for the tail data failed to give a good simulation on the sweep data.

Action: Verify the quality of the model on all relevant classes of excitation signals.

Check of the domain

Changing the nature of the excitation changes also the domain on which the internal nonlinear function (8) that is present in each nonlinear model is evaluated. The best guarantee to get a valid model is given by making sure that the complete domain of interest is covered during the estimation and validation. Although this is not always possible, it might be helpful to check the covered domain by plotting the phase plane trajectory for the estimated model for different excitations. This is illustrated on the Duffing oscillator example in Figure 7 and Figure S10.

Check the uncertainty bounds

As explained in the section “Generation of uncertainty bounds” in “Impact of Structural Model Errors”, and illustrated in Figure S22, it is very hard to provide reliable uncertainty bounds in the presence of structural model errors that dominate the noise disturbances. The actual observed variability of the model is larger than the theoretically expected one. For that reason it is indispensable to verify the validity of the theoretically calculated uncertainty bounds.

Action: Estimate the selected model structure with fixed complexity for different realizations of the excitation and verify if the actual observed standard deviation agrees with the theoretical one.

User guidelines:

•

Validating of a model is a rather subjective and pragmatic problem. Check on a rich validation data set that covers the intended use of the model if the estimated model meets the user expectations.

•

The final modeling goal should be kept in mind during the model validation: in many problems, structural model errors that are below a user defined level can be tolerated, even if these errors are clearly detected in the model validation step.

•

Check also the theoretically obtained uncertainty bounds. In the presence of structural model errors these are under estimating the actual variability.

•

Make sure that the validation tests cover the full domain of interest.

Examples of Nonlinear System Identification: From White to Black Box Models

In the next series of examples, the use of different levels of physical insight in the nonlinear system identification process are illustrated.

Examples I: Off-White Model

Tank System

The system

The cascaded tanks system is a benchmark system that was setup at the University of Uppsala [79]. It is a fluid level control system consisting of two tanks with free outlets fed by a pump that is described in detail in Figure 8.

A physical model

When no overflow occurs, a model can be constructed based on Bernoulli’s principle and conservation of mass:

[TABLE]

where $u(t)$ is the input signal, $x_{1}(t)$ and $x_{2}(t)$ are the states of the system, $w_{1}(t),w_{2}(t)$ and $e(t)$ are additive noise sources, and $k_{1},\ldots,k_{4}$ are constants depending on the system properties.

The relation between the water flowing from the upper tank to the lower tank and the water flowing from the lower tank into the reservoir are weakly nonlinear functions if there is no overflow (29), while in the presence of overflow, hard nonlinearities need to be identified.

To model the overflow, $x_{1}(t)$ and $x_{2}(t)$ are constrained to their maximum value, and an additional term $w_{3}(t)$ is added to the second equation in (29) for $x_{1}(t)>x_{1max}$ [80].

In [80] it is also proposed to add adiitonal terms $k_{5}x_{1}(t)$ to $\dot{x}_{1}(t)$ and $k_{6}x_{2}(t)$ to $\dot{x}_{2}(t)$ , to include also the losses in the fluid flow. The losses are proportional to the verlocity of the fluid squared and, therefore, proportional to the height of the fluid in each tank. It turned out [80] that this additional flexibility in the model is also used to accomodate other imperfections of the model, leading to an improvement that goes far beyond the expected impact of the loss terms in this system.

The data

The input signals are ZOH-multisine signals which are 1024 points long, and excite the frequency range from 0 to 0.0144 Hz, both for the estimation and test case (see Figure 9). The lowest frequencies have a higher amplitude then the higher frequencies. The sample period $T_{s}$ is equal to 4 s, the period length is 4096 s. Two similar data sets were collected, one for estimation and one for test (validation). The water level is measured using capacitive water level sensors, the measured output signals have a signal-to-noise ratio that is close to 40 dB. The water level sensors are considered to be part of the system, they are not calibrated and can introduce an extra source of nonlinear behavior.

Note that the system was not in steady state during the measurements. The system states have an unknown initial value at the start of the measurements. This unknown state is the same for both the estimation and the test data record and needs also to be estimated.

Cost function

The cost function used to match the model to the data is

[TABLE]

with $y_{mod}(t,\theta)$ the model output. Observe that either the simulation or one-step ahead prediction error can be minimized.

Results

In [80], the parameters of the simple and extended physical model are directly estimated using well-selected numerical optimization procedures. The fit error (28) of the simulated output equals $fit=4.23\%$ on the estimation data, and $fit=5.93\%$ on the test data. For the extended model that includes also the ’loss’ terms, this error drops to $fit=1.02\%$ on the estimation data, and $fit=1.78\%$ on the test data.

As a reference, also a black box Gaussian Process NARX model (see also Section “The Palette of Nonlinear Models: Black Models: NARX”) with 15 lags for the input and output was estimated, resulting in $fit=4.62\%$ for the simulation error, and $fit=0.057\%$ for the one-step-ahead prediction error. This illustrates once more that it is much easier to get a small prediction error than it is to get a small simulation error.

Examples II: Smoke-Grey Model (Semiphysical Model): An Industrial Buffer Flow System

The system

This is an example of the usefulness of recalibration of the time scale, [81] as explained in Section “Smoke-grey Models: Semi-physical modeling”. The process is a buffer vessel in a pulp factory, in Skutskär in Sweden. The pulp spends some 48 hours in the different stages of the process, cooking, washing, bleaching, etc. It passes through several buffers to allow for a smooth continuous treatment. It is important to know the residence time in the buffers for proper bookkeeping. The so called $\kappa$ number is a property of the pulp, measuring its lignin contents. The buffer vessel is schematically depicted in Figure 11.

So the problem is to find a model for the dynamics of the buffer vessel that allows evaluating the residence time in the vessel and to “time mark” the pulp as it passes through the several vessels.

The data

In a particular buffer, the $\kappa$ -number of the outflow, the output $y$ , was meausured along with $u$ , the $\kappa$ -number of the inflow. See Figure 12. The vessel level and flow were also measured, Figure 13.

First Model attempt: Linear Model Based on Raw Data

First estimate a linear process model using the input-output data $u,y$ using the first half of the data. That gave the model $G(s)=\frac{0.818}{1+676s}e^{-480s}$ . This model was simulated with the output and the model output is compared with the measured output in Figure 14.

This linear model is quite bad. The simulated output differs quite substantially from the measured output

Apply ”semiphysical modeling”

The physics behind the flow system needs to be taken into account. How does the flow and buffer level affect the buffer dynamics?

It there is no mixing in the vessel: (“Plug flow”), the vessel is just a pure time delay for the pulp flow: Delay time: Vessel Volume/Pulp Flow (dimension time.)

If there is perfect mixing in tank, the system is a text-book first order system with gain=1 and time constant = Volume/Flow

So if Volume and Flow are changing, the system is non-linear.

The natural time variable is really Volume/Flow, (which has been measured). The observed data can be resampled according to this natural time variable.

Recalibrate the time scale

Apply a nonlinear transformation to the raw data by re-sampling it to the natural time variable: In Matlab this becomes

z = [y,u]; pf = flow./level;

t = 1:length(z)

newt = interp1([cumsum(pf),t],[pf(1):sum(pf)]’);

newz = interp1([t,z], newt);

The resampled inputs and outputs are shown in Figure 15.

Second model attempt: Linear model based on resampled data

Building a linear process model from the first half of the resampled inputs and outputs gives the model $G(s)=\frac{0.8116}{1+110.28s}e^{-369.58s}$

Simulating that model and comparing with the measured output (for resampled data) gives a much better fit as shown in Figure 16, analoguous to Figure 14.

The semi-physical model is linear in the resampled data, but nonlinear in the original raw data in Figures 12-13. It gives a sufficiently good description of the buffer, to allow proper time-marking of the pulp before and after.

Examples III: Steel-Grey Model (Linearization-Based Model): The High Pressure Fuel Supply System

The team of Oliver Nelles (University of Siegen, Germany) developed a Local Linear Modeling based approach and used it to identify a high pressure fuel supply system (HPFS) that is used in a common rail direct fuel injection for diesel engines. The reader is referred to [82, 83, 30] for a detailed description of the project and the general methodology. This section is fully based on these references.

The system

The main components of a HPFS system are the high pressure rail, the high pressure fuel pump, and the ECU (Engine Control Unit; see Figure 17). The pump is actuated by the crankshaft of the engine. A demand control valve in the pump allows to control the delivered volume per stroke. A pressure-relief valve is also included in the pump, but should never open, if possible. Hence, we want to limit the maximum pressure during the whole measurement process. The pump transports the fuel to the rail, which contains the pressure sensor. From there, it is injected into the combustion chambers. The system has three inputs and one output. The engine speed $n_{mot}$ affects the number of strokes per minute of the pump and the engine’s fuel consumption. The fuel pump actuation $M_{SV}$ gives the fuel volume which is transported with every stroke of the pump. It is applied by opening and closing the demand control valve accordingly during one stroke of the pump. The injection time $t_{inj}$ is a variable calculated by the ECU, which sums up the opening times of the single injectors and is, thus, related to the discharge of fuel from the rail.

During the measurement procedure, the injection time is not varied manually but set by the ECU. The permissible times depend on many factors and a wrong choice could extinguish the combustion or even damage components. The engine load would have a major influence on the HPFS system via the injection time, but is omitted to prevent the necessity of a vehicle test bench.

The model

A NARX model (22) (see also Section “The Palette of Nonlinear Models: Black Models: NARX”) with external nonlinear dynamics (see “External or Internal Nonlinear Dynamics”)

[TABLE]

will be used to model the pressure $y(t)$ of the rail as a function of three inputs $u=[u_{1},u_{2},u_{3}]$ , with $u_{1}(t)=n_{mot}$ , $u_{2}(t)=M_{SV}$ , and $u_{3}(t)=t_{inj}$ . The local linear modeling method (16) is selected to represent and identify the nonlinear function $h$ [30] (see also the section on Steel-grey models in The Palette of Nonlinear Models). These are specified by the choice of i) the regime points, ii) the validity functions, and iii) the local linear models.

The regime points $p(k)$ , also called $z$ -variables in [82, 83], that are the entries of the validity functions $w(p(t),p_{i})$ (weighting functions) in (16) are reduced to one time delayed process inputs and output in this application: $p(k)=[u_{1}(k-1),u_{2}(k-1),u_{3}(k-1),y(k-1)]$ . This choice takes into account, that the actual operating point is defined by the level of the actual process inputs and output. The first (and higher) derivatives of the model inputs and output are assumed to be insignificant to describe the operating point.

The validity functions $w(p(t),p_{i})$ in this contribution are constructed using the hierarchical local model tree (HILOMOT) algorithm [85]. This incremental growing tree construction algortihm divides the input space with axes-oblique splits. The validity functions are generated by sigmoid splitting functions that are linked in a hierarchical, multiplicative way, see [85] for more details.

The procedure of the HILOMOT algorithm can be explained with the help of Figure 18. Starting with a global affine model, in each iteration an additional local affine model is generated. The local model with the worst local error measure (gray areas in Figure 18) is split into two submodels, such that the spatial resolution is adjusted in an adaptive way.

The local linear models $\hat{y}_{i}(t|\theta,Z^{t-1})$ are here ARX models of order 3, $Z^{t-1}=[u_{1}(t-1),u_{1}(t-2),u_{1}(t-3),\ldots,u_{3}(t-1),u_{3}(t-2),u_{3}(t-3),y(t-1),y(t-2),y(t-3)]$ .

The data

Experiment design: Two main aspects are driving the experiment design. The experiments should be rich enough to get a good estimate for the local linear models, but even more important, the experiments should be designed such that the regime points cover the full space of interest [82, 83, 30]. The excitation signal should also comply with the constraints imposed by the process, like restrictions on the amplitude level and signal gradients. In [83] a procedure is described that searches for randomized step like sequences that meet all these constraints. The step time is set by the dominant time constant of the system, while the step sequences are designed to assure that the regime points are well distributed over the full operational space of the system. Detailed information, including the design of excitations for multiple input systems is given in [83].

The data: The optimized experiment is called OptiMized Nonlinear InPUt Signal (OMNIPUS), and it will be compared to a second experiment where the excitation is a combination of a ramp and a chirp sequences proposed in [86]. The measurement time for each signal is limited to 10 minutes. The sampling frequency $f_{s}=100Hz$ , resulting in a signal length of $N=60000$ samples.

Cost function

The models are estimated by minimizing the squared errors (2) between the measured $y(t)$ and modeled output $\hat{y}(t|\theta)$ (31). No weighting is applied.

Results

Two local linear model networks (LMN) are identified, the first being estimated on the OMNIPUS data, the second on the ramp-chirp data. Both models are used to simulate the measured plant output on a test data set that consists of a new realization of a ramp-chirp and OMNIPUS excitation. The simulated output of both models on the test data are shown and discussed in Figure 19. The error for both local linear models is in most cases small (e.g. Figure 19 (a) and (b)). But there are also some significant mismatches for example in Figure 19 (c). The nonlinear behavior of the process is not well identified by the ramp-chirp LMN. This major mismatch between process and model indicates a poor modeling most likely because informative data in this area of operation are missing. Figure 19 (b) shows a minor mismatch between process and the OMNIPUS LMN.

As a reference, the same data were also processed using Gaussian process models (GPM) for the $h$ function in (31) [82], and shown in Figure 20. The GPMs do not show dramatical mismatches between process and the identified models. In the last half of the signal the ramp-chirp model seems to be slightly worse because of some slightly discrepancies. But overall, the GPMs seem to be valid in the range of operation. A quantitative comparison shows that the GPM and LMN models behave quite similar, but that the design of the experiment has a large impact on the generalization of the models (how do the models behave outside the training domain). In this example, the GPM models seem to be less sensitive to this problem. Overall, these observations emphasizes again the importance of a good design of the experiment.

Examples IV: Slate-Grey Model (Block Oriented Model): Hydraulic Crane

The Process

When handling tree-logs in forest harvesting, huge hydraulic cranes do all the lifing and moving. The cranes can be thought of as industrial robots that are contolled by hydraulic pressure and the conrolled output is the position of the gripper at the top of the crane. See Figure 21.

The crane shows oscillitory behaviour at the gripper and in order to design a good regulator, a model has to be developed

The Data

Some collected data from a particular crane are shown in Figure 22.

Clearly the dynamics is quite resonant.

A Linear Model

To find a model, first build a linear model of the crane using the data. The first half of the data sequence in Figure 22 was used to estimate the model and the second half was used to evaluate the fit between the model’s simulated output and the measured output. After some experimentation the best model was obtained as a 5th order linear state space model. The comparison between model and measured outputs is shown in Figure 23

This best fit for a linear model (42 %) is not very impressive, and not good enough for control design.

Nonlinear models: A hammerstein model

The lack of succss with linear models may indicate that there are nonlinear effects in the system. A simple test to see if nonlinearities can improve the fit is to try a Hammestein model, cf. Figure 5. A Hammerstein model with a 5th order linear system, preceeded by a non-linear (piece-wise linear) static nonlinearity estimated from the first half of the data record gave a model fit depicted in Figure 24. The estimated nonlineariry at the input is shown in Figure 25

The impovement from 42 to 72% fit with the input nonlinearity is quite impressive! In retrospect a “semiphysical” explanation can be given: Most of the resonance dynamics is due to the mechanical contruction of the crane. The transformation from the hydraulic input pressure to actual forces on the mechanical parts of the crane is more complicated, and in that way a model with an unknown nonlinear static transformation of the input acting on a linear system becomes physically feasible. Note that the estimated input nonlinearity essentially is a saturation.

Examples V(a): Black Box Volterra Model of the Brain

Volterra model of the brain

In collaboration with the Department of Biomechanical Engineering of the Delft University of Technology, the Netherlands, a regularized Volterra model has been identified for a part of the human sensorimotor system, as explained in Figure 26. Detailed information on this experiment is reported in [87, 88].

The system

In the experiment, the relation between the wrist joint motion and the electroencephalography (EEG) signals is modeled. The experiment setup used to obtain the EEG data evoked as a reaction to the wrist perturbation is depicted in Figure 26.

Volterra models

Regularized Volterra models are a nonparametric representation of the system, using multidimensional impulse responses (S13), called Volterra kernels. Because the number of parameters grows very fast with the degree of the kernel, an additional constraint will be imposed on it to express that the kernel should be smooth in some directions, and decays exponentially to zero along these directions. This is illustrated in Figure 27 for a second degree kernel. By using constrained models, it is still possible to identify the kernels from relatively short data sets. This turned nonparametric Volterra modeling into a handy tool [89].

The choice of the maximum degree in (S13) is critical. Choosing it too high results in a very fast growing number of parameters to be estimated, choosing it too low creates large structural model errors. A prior nonparametric analysis [15] showed that a linear model can capture only 10% (in terms of Variance Accounted For, VAF) of the characteristics of the wrist joint - brain system while more than 70% is attributed to even nonlinear behavior [88]. On the basis of these results, a second degree Volterra model will be estimated,

[TABLE]

including a DC-offset ( $\alpha=0$ ), a linear kernel ( $\alpha=1$ ), and a quadratic kernel ( $\alpha=2$ ). The output $\hat{y}(t)$ is calculated by direct evaluation of (S13)

Memory length: The average memory length of the selected models was 33 samples corresponding to approximately 130 ms at a sampling rate of 256 Hz. This choice resulted in the smallest structural model errors.

The data

Experiment design: The perturbation signals used were random phase multisines with a period of 1 s, resulting in a fundamental frequency $f_{0}=1$ Hz. Only odd harmonics of the fundamental frequency were excited, namely 1, 3, 5, 7, 9, 11, 13, 15, 19, and 23 Hz (odd random phase multisine). Exciting the nonlinear system using different phase realizations of a multisine signal (i.e. same amplitude per frequency, yet new random phases) allows for using different data for estimation and validation when modeling. Seven different multisine realizations were generated After transient removal, 210 periods are available for each of the seven realizations. The aforementioned choices for the duration of the excitation resulted in a total duration of clear perturbation equal to 24.5 minutes per subject. Including the extra time intervals corresponding to the preparation of the equipment, preparation of the subject as well as the pauses necessary for the safety and convenience of the subject, the total experiment time results to more than 2 hours!

Preprocessing of the data: The EEG measured signals were high-pass filtered with a cut-off frequency of 1 Hz, and also the 50 Hz disturbance from the mains was removed. In a second step, the data were averaged over the periods, resulting in an SNR of about 20 dB. For all participants the recorded output signals were shifted in time to impose a time delay of 20ms.

Cost function

The model parameters are estimated by minimizing a regularized least squares cost function [89], using a Baysian perspective [90] consisting of the sum of the squared differences between the averaged measured output and the modeled output $\hat{y}(t|\theta)$ (32), plus a regularization term (see also (S70) in “Black Box Models Complexity: Keeping the Exploding Number of Parameters under Control; Increased Structural Insight; Model Reduction”).

[TABLE]

The regularization matrix is a block diagonal matrix, where each block $P_{i}$ accounts for the regularization of the Volterra kernel of degree $i$ . Due to this choice, and the selection of an odd excitation (only odd frequency components are present in the multisine), the identification of the linear part is completely decoupled of the even nonlinear part (the zero and second degree kernels). The hyperparameters in the regularization matrices $P_{i}$ are tuned using a marginalized maximum likelihood estimator [90] which leads to a nonlinear numerical optimization in the hyperparameters (the model parameters $[\theta_{0}\theta_{1}^{T}\theta_{2}^{T}]]$ are eliminated in the marginalizing step). Once the hyperparameters are fixed, the remaining problem is linear in the model parameters $[\theta_{0}\theta_{1}^{T}\theta_{2}^{T}]]$ and are directly obtained by solving the linear least squares problem (33).

Results

The results are shown in Figure 28. Only the second degree kernel is shown. The linear part of the model captured about 10% of the output (in terms of Variance Accounted For, VAF) while more than 70% was attributed to even nonlinear behavior in the nonparametric analysis [88]. The Volterra series, combined with the regularization technique described above was used to model the nonlinear system behavior.

The obtained models were able to achieve on average 46% of the VAF on validation datasets across different participants. Keeping in mind that 10% of the output is due to noise, it indicates that about 44% of the output variance remains unmodeled. In this case, a fourth order kernel would be needed to improve the results further on, but even with regularization this remains still an unattractive problem. Richer and longer experiments would be needed, and that is not feasible. Although the second degree Volterra model still suffers from large structural model errors, it provides already very useful insights in the wrist-brain system.

In [71] a new experiment was made using a richer excitation so that the risk for overfitting is further reduced. Instead of exciting only the odd frequencies, also the even frequencies were excited. At that moment, the estimation of the linear part can no longer be decoupled from the nonlinear part of the model. With this rich data set, the VAF for increases to almost 60%, but the qualitative conclusions remained the same.

Examples V(b): Nonlinear State Space Black Box Model

a Li-Ion Battery

Lithium-ion batteries are attracting significant and growing interest, because their high energy and high power density render them an excellent option for energy storage, particularly in hybrid and electric vehicles. In this section, a nonlinear state-space model is proposed for the operating points at the cusp of linear and nonlinear regimes of the battery’s short-term electrical operation. This point is selected on the basis of a nonparametric distortion analysis as a function of state-of-charge (SOC) and temperature. More detailed information is available in [91] for a fixed SOC (10%) and temperature model (25°C). In [92], a nonlinear state space model is developed that covers a varying temperature (from 5°C to 40°C) and SOC (2% to 10%). For higher SOC values, it follows from the nonparametric distortion analysis that a linear model can be used.

The system and the experimental setup

A high energy density Li-ion polymer battery [EIG-ePLB-C020, Li(NiCoMn)] with the following electrical characteristics: the nominal voltage of 3.65 V, the nominal capacity of 20 Ah, and ac impedance (1 kHz) $<3$ m $\Omega$ along with the PEC battery tester SBT0550 with 24 channels is used for the data acquisition. The tests are performed on a preconditioned battery inside a temperature controlled chamber at 25°C.

Nonlinear state space model

A discrete time nonlinear state space model (S29) is selected to approximate the continuous time system (see "Approximating A Continuous Time NLSS With A Discrete Time NLSS Model"). The NLSS in (19) is rewritten as

[TABLE]

Although the split between the linear part and the nonlinear terms $F(x(k),u(k))$ , and $G(x(k),u(t))$ in (34) is not unique, it is convenient to write the equations like that because the initialization procedure that will be presented below starts from the best linear approximation of the nonlinear system. It results in initial estimates for $A,B,C,D$ .

The nonlinear terms $F(x(k),u(k))$ , and $G(x(k),u(t))$ are multivariate nonlinear functions. These will be written as a linear combination of nonlinear basis functions. The whole palette of possibilities can be used here, ranging from polynomials, hinge functions, to Gaussian processes. In this example, a polynomial representation will be used.

The data

An odd-random phase multisine signal is used as an input excitation signal. The band of excitation is kept between 1 and 5 Hz, because the dynamic range of interest of the battery for hybrid and electric vehicles applications is covered well within this band of excitation. It also takes into consideration the limitations of the battery tester in terms of the sampling frequency. The excitation signal has a period of 5000 samples, and the sample frequency $f_{s}$ is set to 50 Hz resulting in a frequency resolution of $f_{0}=50/5000=0.01$ Hz. The input is zero mean with an rms value of 20A. A detailed description of the whole measurement procedure (charging, discharging, $\ldots$ ) is given in [91].

Cost function

The cost function is formulated in the frequency domain on the difference between $Y(k)$ and $Y_{mod}(k,\theta)$ that are the discrete Fourier coefficients of the measured and modeled output.

[TABLE]

where $B$ is the set of frequencies of interest, and $W(k)$ a user selected weighting. Because structural model errors dominate, the weighting $W(k)$ is put equal to 1 (see "Impact Of Structural Model Errors").

Results

Nonparametric distortion analysis

At the start of the identification process, a nonlinear distortion analysis (see “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”) provides a lot of insight about the quality of the data, and the possible gain that can be made with a nonlinear model. In this case, it is seen in Figure 29 that the SNR of the data is very high (above 50 dB). The nonlinear distortions are very low for a high SOC, while they increase above 10% for low SOC (10%). More results for other temperatures, SOC, and amplitude ranges are available in [92]. On the basis of these results, it was decided to focus the nonlinear modeling effort on the low SOC range.

Parametric nonlinear state space model

The nonlinear state space model is identified in two steps. In the initialization step, the best nonlinear approximation is identified (using one of the classical linear identification methods). The noise weighting $W(k)$ in (35) during this step is set by the nonlinear distortions. These results are used to find initial values for the $A,B,C,D$ matrices. In the second step the nonlinear terms are added to the model. These are linear-in-the-parameters and initialized at zero. Eventually a nonlinear numerical optimization is used to minimize the cost function with respect to all the parameters (see [91] for more details). In this and the next steps, $W(k)=1$ because model erros dominate.

In this case, a polynomial nonlinear state space model is used with three internal states, and multivariate polynomials up to degree 3.

A major problem that often shows up in many applications is the appearance of instabilities. Even if the identified model remained stable on the test data, it happened frequently that the model still becomes unstable on the validation data. This is mainly due to the polynomial basis functions that were used in this example. Outside the state space domain covered by the test data, the polynomials have the tendency to grow very fast, which results in instabilities of the model.

The results are discussed in Figure 30. The nonlinear model outperforms the linear model with a factor 10 on the validation data, but the errors are still far above the noise floor.

Remark: Nonlinear state space models that include output feedback: It is also possible to replace the states $x(k)$ by a well selected set of outputs $y(k)$ as the input to the multivariate nonlinear functions, leading back to the full expression (19). Although such a representation can always be reduced to that in (34), it can be much more compact. A typical application are vibrating mechanical systems where the nonlinearities depend often on the displacement of the structure close to the nonlinearities which can be selected to be the states of the nonlinear equations [93]. This choice also simplifies the initialization of the identification problem, because in these applications the states are directly measurable outputs.

Conclusions

In this article, a full overview of the nonlinear system identification process is given. Nonlinear SI is a very rich topic with many different aspects. The selection of the topics, and the organization of the discussion is strongly influenced by the authors’ experiences. The discussion is aligned along the following topics: 1) Is a nonlinear modeling approach needed, or is it still possible to address the users questions using a linear design; 2) The lead actors of the SI process: experiment design, model structure selection, choice of the cost function, validation; 3) Illustration of the wide variety of methods on a series of experiments. The sidebars provide more detailed technical background information on some issues.

The main conclusions are formulated as a set of guidelines and summaries. More detailed guidelines are provided throughout the article.

Guidelines - Summary

•

Linear or nonlinear SI? The need for nonlinear modeling starts where linear modeling fails to solve the problems. Nonlinear SI is more involved than linear SI because models require higher flexibility, and the presence of structural model errors is difficult to avoid. Tools are available to check the level of nonlinearity in a nonparametric preprocessing step, allowing the reader to make a well informed decision.

•

The main actors of the nonlinear SI process

–

Experiment design

Use periodic excitations whenever it is possible.

*

Cover the domain of interest to keep structural model errors under control. Covering the amplitude range and the frequency band of interest are necessary but not sufficient conditions to guarantee that no internal extrapolation will show up in later use of the model.

*

The experiment should be informative to keep the noise induced uncertainty low.

–

Selection of a model structure: The choice of the model class is driven by i) the user preference (white box - black box models), ii) the system behavior (open loop or closed loop NL system), iii) Models for simulation or prediction.

User choice: Decide how much physical insight will be injected. The Palette provides an overview of models ranging from white to black box modeling

*

System behavior: Fading memory (nonlinear open loop models: the nonlinearity is not captured in a dynamic loop) are much easier to identify than nonlinear closed loop models where the nonlinearity is part of a dynamic closed loop. However dealing with complex behaviors like shifting resonances, changing damping, or even chaotic behavior requires nonlinear closed loop model structures.

*

The model complexity and effort is strongly affected by the later use of the model: prediction or simulation. The development of a good prediction model is less demanding than obtaining a good simulation model.

–

Choice of the cost function

Do structural model errors dominate?

·

Yes: the choice of the cost function is set by user criteria to shape the model errors.

·

No: the choice of the cost function is set by the noise properties.

*

Regularization is a very powerful tool that can help to keep the model complexity under control. It can be used in black box models to create either sparse models, or to impose smooth solutions.

–

Validation: does the model meet the user needs? Are the errors small enough in the domain of interest? Is there information left in the data?

Test the model on new data and check the internal domain

*

Check the linear validation criteria: Whiteness test of the output residuals, and cross-correlation test between the input and output residuals?

*

Nonlinear validation criteria: for example higher order moment tests.

•

Uncertainty bounds: If structural model errors are present (the model does not pass the validation test, but it is good enough for the intended application), no reliable theoretical uncertainty bounds are available. The variability of the model should be obtained from repeated experiments with varying excitations.

Sidebar

Static Nonlinearities

A static nonlinearity is the basic building block in nonlinear models. Consider the mapping

[TABLE]

If $z$ and $x$ are time varying signals in the model, the mapping is applied for each $t$ : $x(t)=H(x(t))$ , so the mapping is static. Such a mapping can occur in many contexts in a nonlinear model, and is indeed what constitutes the nonlinear behavior. See also (8). In a state-space model, $H$ can be mapping from the state at time $t$ , to the state at the next time instant, (20). In a regression model like (22), $H$ can be the mapping from the regressors to the model output. In block oriented models, Figure 5, the static nonlinearity block is a fundamental component. In this sidebar, the parameterization of static nonlinearities, $H(x,\theta)$ , will be discussed.

SISO nonlinearities

Consider the simple SISO case ( $p,m=1$ ) which brings out all the essential ideas. This is the case of describing a one-dimensional curve.

Breakpoint-based: Piecewise Constant and Piecewise

Linear

A very simple idea to parameterize a curve is to define its values at a number of breakpoints, $\{x_{1},\ldots,M\}$ , so

[TABLE]

would be the parameterization of the curve. An interpolation rule must then be applied to define $H(x)$ at the intermediate points:

•

Piecewise constant : the value $H(x)$ equals $H(x_{k})$ where $x_{k}$ is the breakpoint immediately to the left of $x$ .

•

Piecewise linear: The value $H(x)$ is linearly interpolated from its two neighbouring breakpoints.

Clearly, more sophisticated interpolation rules could also be used.

Basis function Expansions

A common way to parameterize functions is to choose a system of basis functions, $\{\rho_{k}(x),k=1,\ldots,M\}$ and parameterize the curve as

[TABLE]

Where the basis functions $\rho$ may or may not depend on $\theta$ . Very many basis function expansions are possible, and a few will be reviewed here:

Custom Regressors

With some physical insights, – or “semiphysical modelling” –, the user can come up with specially chosen basis functions $\rho_{k}^{c}(x)$ that reflect typical nonlinearities for the application in question, e.g. using $\rho(x)=\sqrt{x}$ if $z=H(x)$ describes the output in a free level flow system on $x$ is the level in the flw system. This gives the expansion

[TABLE]

Polynomial Expansion

The most common “black box” expansion is the polynomial expansion ( $\rho_{k}(x)=x^{k}$ ), the Taylor series

[TABLE]

Such polynomial expansions are in modelling contexts where $x$ consists delayed inputs, also known as a Volterra model, see (S13). (See also “External or internal nonlinear dynamics: Volterra Models”).

Linear Regressions

Note that if the basis functions $\rho_{k}$ are known (do not depend on the parameter $\theta$ , as in the previous two subsection) then (S3) is a linear regression, so it is easy to estimate $\theta$ from measurement of $z$ and $x$

Local Basis Functions

It is also possible to construct local basis functions, ”pulses” that are nonzero only over certain intervals:

[TABLE]

Then, if $\gamma_{k+1}=\gamma_{k}+1/\beta_{k}$ the expansion

[TABLE]

will be a piecewise constant function, like (S2) with breakpoints $\{\gamma_{k}\}$ . It can approximate any reasonable function arbitrarily well with sufficiently large $M$ . Note that

[TABLE]

where $\kappa(\cdot)$ is the unit pulse indicator, 1 for $0\leq x<1$ and zero elsewhere.

Neural Networks

Inspired by the approximation capability of (S7,S8) a “mother function” $\kappa(x)$ (also known as activation function) can be selected, translated it by $\gamma_{k}$ and dilated it by $\beta_{k}$ and the expansion

[TABLE]

can be be defined. Common activation functions are the Gaussian Bell (a “soft pulse”)

[TABLE]

and the sigmoid function (a “soft step”)

[TABLE]

Another common choice of the activation function is $\kappa(x)$ = ReLU(x) (Rectified Linear Unit)

[TABLE]

this makes $H$ piecewise linear and continuous in $x$ .

The expansion (S9) represents the simplest neural networks.

Trees

A useful mapping $z=H(x)$ can be defined via (binary) trees. Such trees are popular mappings and used in e.g. decision trees and classification trees. Here a regression tree will be defined. Loosely speaking, the value $x$ will be subjected to a number of binary questions at different nodes. All the binary answers will be collected and based on those a value $z=H(x)$ will be assigned.

More specifically, at the root node, for a scalar $x$ the basic question is $x\geq c_{0}?$ for some number $c_{0}$ . Depending on the answer a new question will be asked at the next level $x\geq c_{1}^{r}?,r=1,2$ . And so it continues, so at level $k$ there are $2^{k}$ nodes with similar questions $x\geq c_{k}^{r}?,r=1,...,2^{k}$ . For a tree with depth $n$ the final level is $k=n$ with $2^{n}$ nodes, called leaves. They correspond to a partition of the $x$ -axis into $2^{n}$ parts. Each leaf has a value $d_{k}$ for $H(x)$ at the $x$ -value that leads to that point. [Some branches may be cut before level $n$ , then the leaves are defined at a lower level.] So the tree defines a function $H(x)$ which is piecewise constant - all values of $x$ that belong to the same partition give the same value $d_{k}$ . So a tree is an alternative to piecewise constant functions (S2) where the breakpoints are defined in a somewhat contrived way be the tree partitioning. For a regression tree it is customary to add an interpolation step to compute the function value:

[TABLE]

where $d_{k}$ is the value produced by the leaf of the tree and $L_{k}$ is a scalar also provided by the leaf in question. In that way the tree $H(x)$ becomes a piecewise linear function of $x$ .

Gaussian Processes

Gaussian processes (GPs) offer a nonparametric approach to model $z=H(x)$ in (S1) that can be put in a Bayesian framework [46]. This static model can then be used to form dynamic models as in $f(x(t),u(t),\theta)$ and $h(x(t),\theta)$ in (20) or NARX models and other examples in this sidebar.

The idea is that function to be estimated is embedded in a stochastic framework so that $H(x)$ is seen as a realization of a (Gaussian) stochastic process.

So there is a prior (before any measurements) distribution with a (zero) mean, (large) variance and covariance function that describes the smoothness of $H$ .

When $z_{k}=H(x_{k}),\;k=1,\ldots N$ have been measured, the posterior distribution $H^{p}(x|z)$ can be formed for any $x$ . The mean of that function will be the estimate of the function $H$ and is formed by interpolation and extrapolation $[z_{k},x_{k}]$ using the probabilistic relationships. The reliability of the estimate can also be assessed from the posterior variance. [But is of course a reflection of the prior assigned distributions.]

If all variables are jointly Gaussian all this can be computed by simple and efficient linear algebraic expressions.

The idea is depicted in figure S1. The framework can be seen as a generalization of the regularization approach [90], see the example (33) and the discussion in “Black Box Models Complexity”. The reader is referred to [47] for a full introduction to the topic. A recent tutorial to apply GPs to modeling of dynamic systems is given in [46].

MISO nonlinearities

When the nonlinearity is multi input ( $m>1$ ) $x$ will be an $m$ -vector. Most of the ideas for SISO models carry over to the MISO case. Some specific aspects will be noted here.

Polynomial Expansion

In the polynomial expansion (S5) the term $\theta_{k}x^{k}$ should be interpreted as the sum of all those terms that can be created from exponentials of the components $x_{i}^{r_{i}}$ whose exponents sum up to $k$ . So each term $x^{k}$ expands into several new terms. Each of these terms require its own parameter, so the number of parameters in (S5) rapidly increases with $m$ and $M$ . A common special case of polynomial expansion is the Volterra model.

Volterra models

A Volterra system [94] is a NFIR system with a multivariate polynomial nonlinearity. Let the regressors be $\varphi(t)=[u(t),u(t-1),u(t-m+1)]$ . Then the model is

[TABLE]

The kernel $g_{\alpha}(\tau_{1},\ldots,\tau_{\alpha})$ is the multidimensional impulse response of degree $\alpha$ and it correspond to the model parameters $\theta_{k}$ .

The number of terms in the Volterra models is $O(m^{M})$ , and it becomes very large when the memory length grows. For that reason, Volterra models were only applied to problems with short memory length $m$ and moderate polynomial orders $M$ . Only recently, more complex problems could be handled [89, 88] by using the regularization framework [90] to reduce the impact of the exploding number of parameters (see “Black box models complexity: Keeping the exploding number of parameters under control; Increased structural insight; Model reduction”).

Neural Networks

For the neural network model (S9) it is customary to stick to the activation function $\kappa(x)$ with a scalar argument and rather reinterpret the argument with vector $x$ in either of two ways

•

Ridge construction Let $\beta_{k}$ be an $m$ -vector and interpret $\kappa(\beta_{k}(x-\gamma_{k}))$ as $\kappa(\beta_{k}^{T}x-\gamma_{k})$ . That means that the basis function assumes the same value for all $x$ in the hyperplane $\beta_{k}^{T}x=constant$ , thus creating a ridge structure for the function values. With a sigmoid activation function this leads to the celebrated one hidden layer feedforward sigmoid neural net.

•

Radial construction. Let $\gamma_{k}$ be an $m$ -vector of translation and let $\beta_{k}$ by a positive definite matrix of dilation coefficients (or a scaled version of the identity matrix) and interpret $\kappa(\beta_{k}(x-\gamma_{k}))$ as $\kappa(\|x-\gamma_{k}\|_{\beta_{k}})$ , where $\|x\|^{2}_{\beta_{k}}=x^{T}\beta_{k}x$ is a quadratic norm defined by $\beta_{k}$ . This means that the contribution of the basis function only depends on the distance between $x$ and a given centerpoint, thus giving a certain radial symmetry of the function. With the Gaussian activation function, this gives the well known radial basis neural network.

Trees

With an $m$ -vector $x$ , the questions asked at the nodes will be $[1x^{T}]^{T}C_{k}^{r}>0?$ for $m+1$ vectors $C_{k}^{r}$ . The linear interpolation term $L_{k}$ in (S12) will also be a $m$ -vector provided by the leaf in question.

MIMO nonlinearities

Turning to the Multioutput case, $p>1$ there is a common and simple solution: Treat it as $p$ independent MISO cases! If models are built from data, the actual models for the different outputs may come up in different shapes - and different model orders. Grey box thinking and/or experimental evidence may suggest that the same parameters may be used in different output channels, so as to get a more efficient representation. Some thinking of this kind is treated in the Sidebar “Decoupling of Multivariate Polynomials.” But this does not change the picture that the essential features of MIMO representation of static nonlinearities are captured when MISO models are constructed.

Sidebar

Nonparametric Noise and Distortion Analysis Using Periodic Excitations

In this section, tools will be presented that allow the user to detect and analyze the presence of nonlinear distortions during the initial tests. Using a well designed periodic excitation (27), the frequency response function of the BLA (see “Linear Models of Nonlinear Systems”), the power spectrum of the disturbing noise, and the level of the nonlinear distortions will be obtained from a nonparametric analysis without any user interaction. The user can set the desired frequency resolution and the desired power spectrum of the excitation signal. The phase will be chosen randomly on $[0,2\pi)$ . In this article, only a brief introduction is given, the reader is referred to [8, 57] for a more extensive discussion, and to [95] for measurements under nonlinear closed loop conditions.

The response of a nonlinear system to a periodic excitation

A linear time-invariant system cannot transfer power from one frequency to another while a nonlinear system does. Consider a nonlinear system $y=u^{\alpha}$ , excited at the frequencies $\pm\omega_{k},k=1,\ldots,F.$ The frequencies at the output of such a system are given by making all possible combinations of $\alpha$ frequencies, including repeated frequencies, selected from the set of $2F$ excited frequencies (see also Figure S2 for an illustration on a sine excitation with frequencies {-1,1}) [8]:

[TABLE]

Using Volterra series models (S13) [96], this result can be generalized to dynamic fading-memory systems that include also discontinuous nonlinear systems [36, 8, 58]. Chaotic systems are excluded from this study, because these have no periodic output for a periodic input.

In the next section, the nonlinear distortions will be detected using a multisine excitation where some amplitudes $U_{k}$ in (27) are put equal to zero for a well-selected set of frequencies.

Detection and characterization of the nonlinear distortions

Sine test: The most simple nonlinearity test is a sine test. As shown in Figure S2, nonlinear operations, like $x^{2}$ or $x^{3}$ create harmonic components (S14) that can be detected at the output of the system and reveal the nonlinear behavior of the system. This result can be generalized to dynamic nonlinear systems. The sine test method is very popular, for example, in mechanical engineering. To speedup the measurement, the sine is replaced by a sweeping sine [97, 15]. A sine test is not very robust because the higher harmonics that indicate the presence of nonlinear behavior can be amplified or attenuated by the linear dynamics of the system. In highly resonating or bandpass systems, the presence of nonlinearities can be strongly underestimated. For that reason, more robust tests were developed using multisine excitations (27).

Multisine test: Using well designed multisines, the nonlinear sine test is robustified (more reliable estimate of the nonlinear level) and speeded up [15]. The basic idea, illustrated in Figure S3, is very simple and starts from a multisine (27) that excites a well-selected set of odd frequencies (odd frequencies correspond to odd values of $k$ in (27)). This excitation signal is applied to the nonlinear system under test. Even nonlinearities show up at the even frequencies because an even number of odd frequencies is added together. Odd nonlinearities are present only at the odd frequencies because an odd number of odd frequencies is added together. At the odd frequencies that are not excited at the input, the odd nonlinear distortions become visible at the output because the linear part of the model does not contribute to the output at these frequencies (for example, frequencies 5 and 9 in Figure S3). By using a different color for each of these contributions, it becomes easy to recognize these in an amplitude spectrum plot of the output signal.

Disturbing noise characterization

In the next step, the disturbing noise analysis is made. By analyzing the variations of the periodic input and output signals over the measurements of the repeated periods, the sample mean and the sample (co-)variance of the input and the output disturbing noise is calculated, as a function of the frequency. Although the disturbing noise varies from one period to the other, the nonlinear distortions do not, so they remain exactly the same. This results eventually in the following simple procedure: consider the periodic signal $u(t)$ in Figure S4. The periodic signal is measured over $P$ periods. For each subrecord, corresponding to a period, the discrete Fourier transform is calculated using the fast Fourier transform (FFT) algorithm, resulting in the FFT spectra of each period $U^{[l]}(k),Y^{[l]}(k),$ for $l=1,\ldots,P$ . The sample means $\hat{U}(k),\hat{Y}(k)$ and noise (co)variances $\hat{\sigma}_{U}^{2}(k),\hat{\sigma}_{Y}^{2}(k),\hat{\sigma}_{YU}^{2}(k)$ at frequency $k$ are then given by

[TABLE]

and

[TABLE]

In (S15), $(.)^{H}$ denotes the complex conjugate. The variance of the estimated mean values $\hat{U}(k)$ and $\hat{Y}(k)$ is $\hat{\sigma}_{U}^{2}(k)/P$ and $\hat{\sigma}_{Y}^{2}(k)/P,$ respectively. Adding together all this information in one figure results in a full nonparametric analysis of the system with information about the system (the FRF), the even and odd nonlinear distortions, and the power spectrum of the disturbing noise, as shown in Figure 4 for the forced Duffing oscillator.

Sidebar

Simulation Errors and Prediction Errors

There are two basic uses of a model: Simulation and Prediction.

Simulation means that the model is subject to an input sequence $U^{t}=[u(1),u(2),\ldots,u(t)]$ and its response to that input is computed. Such applications are important to evaluate the system’s behaviour under new situtations without having to do actual experiments.

Prediction means that an observed input-ouput sequence $Z^{t}=[u(1),y(1),u(2),y(2),\ldots,u(t),y(t)]$ is given and a prediction of the next output $\hat{y}(t+1|t)$ is sought. (Or outputs $k$ steps ahead $\hat{y}(t+k|t)$ , in which case also tentative future inputs $u(t+1),\ldots,u(t+k)$ should be supplied.) Such applications are important for system prediction, but primarily for control design - control of a system subject to disturbances can be seen as control of the predicted output.

In (9) it was stated that a model is essentially a predictor for the output. The model is a simulation model if the prediction only depends on past inputs, so it will be focused on the simulation task. It is a prediction model if the prediction also depends on past outputs.

Note that a prediction model $h(Z^{t})$ (omit the parameter vector $\theta$ for simplicity) can also be used as a simuation model (simulating future outputs without access to past outputs), simply y replacing, recursively, the outputs in the pedictor by predicted outputs: Let

[TABLE]

This is depicted in Figure S5.

What can be said about the error created by simulation and prediction models, respectively? Let $h_{0}^{p}$ be the true prediction description of a system (based on the correct statisical properties. Let $h_{0}^{s}(U^{t})$ be the simulation model created from this using (S16). Then $\hat{y}_{0}(t)=h_{0}^{s}(U^{t})$ will be the noise-free output response to the input $U^{t}$ , and

[TABLE]

will be the true output error disturbance.

If $v$ happens to be white noise, the true model structure is a simulation model and $\hat{y}_{0}(t)$ are indeed the optimal predictions.

Otherwise, the disturbance $v(t)$ has a smooth or correlated behavior so that its future values can be partly predicted from its past. This property is used to improve the quality of the prediction. The past values of $v(t)$ can be estimated as the difference between the past model and measured output values $\hat{v}(t)=y(t)-\hat{y}_{0}(t)$ . This is the intrinsic idea that is used in the development of optimal prediction models.

Whether it is worth while to further develop an improved predictor or not depends on the nature of the disturbance $v(t)$ .

•

$v(t)$ is dominated by measurement or sensor noise: sensor or measurement noise are not related at all with the process of interest. In that case the main goal is the elimination of this disturbance so that $y_{0}(t)$ is the signal of interest, and hence a simulation model is the natural choice.

•

$v(t)$ is dominated by process noise: process noise is an intrinsic part of the system output. It models that part of the system output that is due to inputs that are not known to the user. In control applications good predictions are important so that the noise disturbance needs also to be included in the model. Moreover, the past outputs are available to decide on the next control action which turns the prediction model into a natural tool for these applications.

•

$v(t)$ is dominated by structural model errors: If the model set is not rich enough to capture the true system, structural model errors appear (See “Impact of Structural Model Errors” which can also be represented as a ’disturbance’.

A predictor can use past outputs to be informed about structural errors and find a better predictor than a simulation model. To take a trivial example: Consider a simple process $y(t)=y(t-1)+u(t)$ which ignores a structural error in the true system: $y(t)=y(t-1)+u(t)+C$ ( $C$ ignored in the model). The predictor $y(t)=y(t-1)+u(t)$ will have an error $C$ , while the simulated output cannot avoid an error $Ct$ .

This is one of the reasons why prediction is an ‘easier’ task than simulation, and why it is more demanding to get a small simulation error than a small prediction error.

Sidebar

Extensive Case Study: The Forced Duffing Oscilator

Throughout this article many results are illustrated on the forced Duffin oscillator, sometimes called the Silverbox in nonlinear benchmark studies. This section describes the setup and the experiments that are used throughout this article. A detailed description of the experiments is given in [15].

System, experimental setup, experiments

The system is an electronic circuit that mimics a nonlinear mechanical system with a cubic hardening spring as shown in Figure S6. This class of nonlinear systems has a very rich behavior, including regular and chaotic motions, and the generation of sub-harmonics [98, 99].

The experimental setup consists of a generator and two data acquisition cards that are synchronized to avoid leakage errors in the spectral analysis. The generator starts from the ZOH (zero-order-hold) reconstruction [20] $u_{\text{ZOH}}$ of a discrete time sequence $u_{d}(k)$ that is passed through a lowpass generator filter $G_{\text{gen}}$ to eliminate the higher harmonics of the ZOH reconstruction, $u(t)=G_{\text{gen}}u_{\text{ZOH}}$ . The sampling frequency is $f_{s}=10\text{MHz}/2^{14}\approx 610\text{Hz}$ . The data acquisition cards are alias protected (the signals are passed through a lowpass filter before sampling) and sample the input and output at a rate $f_{s}$ . High impedant buffers are used to eliminate the interaction between the plant and the measurement setup.

The experiments are shown in Figure S7 and consist of three parts using a ’tail’ (a), ’arrow’ (b), and ’sweeping sine’ (c) input $u(t)$ . The output (d),(e),(f) of the circuit corresponds to the displacement $y(t)$ . The following observations can be made

•

The tail (a) and swept sine excitation (c) have about the same RMS value. The arrow signal (b), that has a maximum amplitude that is twice that of the other signals, will be used to verify the extrapolation capabilities of the models that are identified on the other excitations. The swept sine excitation is a Schroeder multisine [8] that has a dominant odd behavior because the amplitude of the excited odd frequencies is about 30 dB dB above the level of the excited even frequencies (see (f),(l)).

•

All input signals have the same bandwidth (see Figure S7 (g),(h),(i)).

•

Observe that the output (f) for the sweeping sine (c) becomes very large around the resonance frequency, even if the input level remains constant. This results in an internal extrapolation for models that are identified on the tail (a).

•

In the input spectrum (g), it can be seen that there are spurious components at the odd multiples of the mains frequency (50 Hz). The signal is picked up by the circuit and acts as process noise.

•

The nonparametric distortion analysis in Figure 4 shows that he SNR of these measurements is 40 to 60 dB (measurement and process noise at the 1% to 0.1% level).

Linear simulation and prediction of the forced Duffing oscillator

The behavior of simulation and prediction errors (see “Simulation Errors and Prediction Errors”) is illustrated on the forced Duffing oscillator. Because the SNR of these measurements is very high (noise disturbances well below 1%), the simulation and prediction errors in the study below are completely dominated by structural model errors. The simulation and prediction errors will be shown for linear Box-Jenkins and ARX models [6, 7], and a nonlinear NARX model.

Linear simulation and prediction of the Duffing oscillator

In a first step a linear model

[TABLE]

with

[TABLE]

is estimated on the ’tail’ experiment using the prediction error framework [6, 7]. The order of the BJ-model is $n_{a}=n_{b}=2$ and $n_{c}=n_{d}=6$ , the plant and noise model $G,H$ are independently parametrized. For the ARX-model the order of the plant model is also $n_{a}=n_{b}=2$ , while in this case the noise model is $C=1,D(q,\theta)=A(q,\theta)$ . The latter fits with the assumption that disturbances origin mainly at the input of the system so that it shares its dominant dynamics with the input signal. The BJ and ARX simulation errors are shown in Figure S8 and compared to the 95% amplitude levels calculated from the estimated noise models. The BJ-noise model describes the disturbances very well, while the ARX model under estimates the errors around the resonance frequency.

These results are used to simulate $\hat{y}_{0}(t)=G(q,\theta)u(t)$ (see Figure S9 (g),(h),(i)), and predict $\hat{y}(t)=H^{-1}(q,\theta)G(q,\theta)u(t)+(1-H^{-1}(q,\theta))y(t)$ (see Figure S9 (j),(k),(l)) the output for the three experiments [6, 7]. The simulation errors of both the BJ and ARX model are on top of each other in these figures. However, the better BJ noise model results in significantly smaller prediction error for the BJ models compared to those of the ARX model. By increasing the model orders $n_{a},n_{b}$ for the ARX model, these results could be improved. However, that would increase the number of terms in the NARX model. Therefore, and also for didactic reasons, the discussion is continued with the too simple model. In practice, the user has to balance the model complexity against the required model quality.

Summary

•

Linear prediction can provide good results in the presence of nonlinear distortions. This is much harder for linear simulation.

•

The quality of the ‘noise’ model has a strong impact on the quality of the predictions.

•

A changing nature of the excitatition signal results in a changing behavior of the residuals (due to structural model errors). This can turn a good prediction model in a poor one.

Nonlinear simulation and prediction of the forced Duffing oscillator using a NARX and a Nonlinear State Space Model

A polynomial NARX (22) and a polynomial nonlinear state space model (20) is estimated on one of the realizations of the random phase multisines in the tail.

The NARX and PNLSS model

The NARX model is of order $n_{a}=2,n_{b}=2,$ and polynomial expansion of degree 3 with arguments $R=[u(t),u(t-1),u(t-2),y(t-1),y(t-2))]^{T}$ . The selection of the nonlinear degree and arguments was done following the results in [100] that were obtained on a trial and error basis.

[TABLE]

It might also be possible to get better results by replacing the polynomials by other basis functions. Most important is to observe that his model is linear in the parameters $c_{p,m-p}$ , which will reduce the identification to a problem that is linear-in-the-parameters for a quadratic cost function.

The polynomial NLSS is of order 2 (two states) and polynomial degree 3 with arguments $R=[x_{1}(t)x_{2}(t)u(t)]^{T}$ :

[TABLE]

Observe that only the state transition equation has a nonlinear term, it turned out that adding a nonlinear term to the output equation did not improve the results.

Cost function

NARX models with an expansion in known basis functions are linear-in-the-parameters, and hence the cost function, given by the squared equation errors, is minimized by solving a linear set of equations, so that no initialization problem needs to be solved. This is the major advantage of this class of NARX models, compared to many of the other models presented in this article. Because the noise enters nonlinearly into the model, a bias will appear, but as long a the SNR is decent (for example better than 20 dB), this will not be the major issue. For that reason, NARX models became one of the most popular methods, and they are very widely applied in many different fields.

For the NLSS, the cost function is nonlinear in the parameters (non convex), and a numerical method is used to minimize the costfunction.

[TABLE]

where $u(t),y(t)$ are the measured values.

Results

The results for the NARX and NLSS models are shown and discussed in Figure S10. For smaller output levels, the two models are quite comparable, but for larger outputs, the errors of the NARX model are two to three times larger than those of the NLSS model.

Summary

•

NARX models with an expansion in known basis functions are very attractive because these are very simple to use: no initialization, and no problems with local minima show up. The PNLSS results in (slightly) better results because it was better tuned.

•

The nonlinear prediction models do significantly better than the linear ones, even if they are not optimally tuned.

•

The polynomial regression functions can be replaced with more convenient alternatives from machine learning (Gaussian bells, hinge functions, etc.) as discussed also in “Static Nonlinearities”.

•

The NARX method can be combined with pruning and structure revealing methods to simplify the models.

Pruning, data driven structure retrieval, and model reduction

NARX model pruning

The major drawback of NARX models is the combinatorial grow of the number of parameters with the number of regressors (in this example 5) and the degree (in this example 3). No further pruning of the model was made in the results presented here. It is well known pruning can improve the model quality significantly, especially if the number of regressors grows [44].

In [44], different strategies are presented to include the dominating terms gradually, for a growing complexity of the NARX model.

Recently, a top down approach is proposed using the decoupling strategy presented in "Decoupling Of Multivariate Polynomials" [100]. First a full model (including all regressor combinations) is identified, resulting in a single output multivariate polynomial which is next decoupled as a sum of univariate polynomials $P_{k}$ that act on linear combinations of the regressors. In [100], the NARX Silverbox model is decoupled with 4 polynomials of degree 3: $y(t)=\sum_{k=1}^{4}P_{k}(v_{k}^{T}R)$ . Especially for a large number of regressors, this method offers a systematic approach to model pruning.

NLSS model reduction and model data driven structure retrieval

Decoupling: Similar to the NARX model, it is also possible to decouple the multivariate nonlinear vector function $[f_{1},f_{2}]^{t}$ in (S21) using the decoupling strategy presented in "Decoupling Of Multivariate Polynomials". It turned out that 4 internal SISO branches were needed in the decoupled representation (see Figure S27) that are shown in Figure S11. In a second step the degree of the polynomials was increased from 3 to 5, and the cost function was minimized for this decoupled model. This reduced the RMS error of the decoupled model to 0.40%, while the original full $3^{th}$ degree model had an RMS error of 0.49% (see Table I).

Model reduction: In the following step, the number of branches of the decoupled model was reduced. In each reduction step, a new cost function minimization was done. The RMS errors of the reduced models on the Tail and Arrow validation data is shown in Figure S12. Although it could be expected that the error would start to grow when the model is simplified, it turns out that the errors remain constant on the Tail data and even drops significantly on the Arrow data. This indicates that the true system can be described with only one nonlinear branch in the decoupled model (see Figure S27). The reduced error on the Tail data indicates a better generalization (extrapolation) capability of the reduced model. The models are estimated on the Tail data that cover a smaller domain than the Arrow and Sweep data as is shown in Figure 7.

The evolution of the RMS error on the Tail validation data, the number of linear parameters $n_{\theta_{L}}$ , and the number of nonlinear parameters $n_{\theta_{NL}}$ as a function of the model complexity during the model reduction process is given in Table I.

Data driven structure retrieval: The final NLSS model is

[TABLE]

In these equations, $g(p)$ is a SISO polynomial of $p$ that is a linear combination of the states $x_{1},x_{2}$ and the input $u$ (with coefficients $v_{1},v_{2},v_{3}$ ). The coefficients $w_{1},w_{2}$ scale the polynomial output $g(p)$ .

The validation results on the Arrow data are given in Figure S13. Observe that the full and the final reduced model have almost the same quality.

Because the forced Duffing oscillator is a lab setup, this model can be compared to the physical model. It turns out that in this case the data driven retrieved model structure and the physical model structure are identical. This is a very motivating result for the black box identification framework. However, this conclusion cannot be generalized because it is not obvious that the most simple model coincides with the physical model. Moreover, the structure of many (non)linear systems is unidentifiable from input output data due to indistinguishability problems: the same data can be represented by multiple models that can not be distinguished from each other without having additional information (for example the measurement of an internal signal). In the case of the forced Duffing oscillator, it is also possible to represent the same input-output behavior by a system with a nonlinearity in the forwards path instead of the feedback path [42].

Remark

An alternative approach to obtain a highly structured model for the forced Duffing oscillator in a black box modeling framework is presented in [101]. It is based on a Data-Based Mechanistical Modeling approach as explained in [102], where the objective is to obtain a model that can be interpreted in the mechanistic terms that are most appropriate to the nature of the dynamic system. The nonlinear nature of the system is captured by using state dependent parameters. The basic idea is to start with a linear model and check for data dependent variations of (some of) the parameters in the model. For the Duffing oscillator, this resulted in a transfer function model that is very similar to the final model (S23).

Summary

•

The decoupling method offers a data driven systematic approach to model pruning.

•

Tuning the number of branches offers a data driven model reduction approach.

•

The simplified models can significatnly improve the intuitive insight in the nonlinear model behavior.

•

The underlying physical model structure can sometimes be retrieved, although no guarantee can be given that a simple model coincides with the physical model.

Conclusions

In this case study structural model errors dominate the noise disturbances. The following general observations can be made:

•

The prediction errors for a given plant model are smaller than the simulation errors. This was expected because prediction methods use explicitly the output measurements to reduce the simulation error.

•

The prediction method decreases the impact of the structural model errors compared to the simulation method. This holds true as well for the linear simulation/prediction models as for the nonlinear models.

•

The quality of the prediction depends strongly on the quality of the noise model. The better BJ noise model results also in better predictions than those of the NARX model.

•

A nonlinear model captures better the system behavior than a linear model does. This results in smaller simulation and prediction errors.

•

The decoupling methods are a powerful tool to pruning, data driven structure retrieval, and model reduction

Sidebar

Approximating a Continuous Time NLSS with a Discrete Time NLSS Model

Although real-world systems evolve in continuous time, discrete time models are preferred in many (control) applications because most simulations and many controllers are implemented digitally [103]. However, without special care, the discrete time approximation can have a completely different behavior than the original continuous time system [104, 105]. To provide a better understanding of these problems, the discretization of linear systems is first considered, and next the discrete time approximation of continuous time nonlinear systems is discussed.

Discretization of linear systems

ZOH-discretization: In linear system identification, it is well known that a continuous time system that is excited by a zero-order-hold (ZOH) excitation (also called piecewise constant excitation [6]) can be replaced by a discrete time model that gives an exact description of the discrete time input-output relations [6, 7, 20]. Generalizing this result to nonlinear systems is not always possible because the ZOH nature can be lost inside the system. For example, the output of the feedback in a nonlinear closed loop systems will be no longer ZOH. For that reason, solutions that do not rely explicitly on the ZOH-nature are needed.

Alternative discretizations: Replacing the continuous time differentiation by a finite difference is intuitively very appealing:

[TABLE]

or in the Laplace- and Z-domain

[TABLE]

In the frequency domain, $s=j2\pi f$ and $z^{-1}=e^{-j2\pi f/f_{s}}\approx 1-j2\pi f/f_{s}+O((f/f_{s})^{2}$ , and (S25) becomes

[TABLE]

This simple solution works only well if the sample frequency $f_{s}$ is much larger than the frequency band of interest. In Figure S14 (a) it can be seen that the relative error grows to 100% for $f>0.3f_{s}$ . In Figure S14 (b) the transformation (S25) is applied to a continuous time first order system

[TABLE]

Observe that also here the error is very small around the origin, but grows very fast to 100% for $f>0.3f_{s}$ .

A very popular approach in the system identification community is to identify directly, in the frequency band of interest, a discrete model $G(z)$ for he continuous time system $G(s)$ . This is also illustrated in Figure S14 (b). A first order discrete time model $G_{2}(z)$ is obtained by a least squares fit in the frequency band $[00.25]f_{s}$ . Observe that the error of $G_{2}(z)$ is much smaller than that of $G_{1}(z)$ .

The reader is referred to the signal processing literature for more information on the classical solutions to this problem (impulse invariant transformation, bilinear transformation, …) [106].

Discretization of nonlinear systems

Finding good discrete time approximations for continuous time nonlinear systems has been studied for a long time [107, 44, 108, 109]. As was done for linear systems, the discussion starts with dedicated methods for ZOH excitions, followed by an approach to deal with the more general class of lowpass excitations.

ZOH framework: An exhaustive overview of the discretization problem is made in [103], and references therein. The authors look for an approximate discrete time representation, operating in a ZOH-framework. Some of their major conclusions are:

•

The popular ’forward Euler method’ should be used with extreme care because it results in relative errors that grow fast with $f/f_{s}$ leading to the need for a very high oversampling. This confirms the observation made in Figure S14.

•

The dynamics of sampled data models can be different from those of the continuous time system, numerical ’sampling zeros’ are created.

•

A ’Truncated Taylor Series Approximate Model (TTS)’ is proposed to approximate a continuous time nonlinear state space equation by a discrete time equivalent. The global fixed-time truncation error (the error when integrating over a fixed time interval) drops to zero, proportional with the inverse of the sampling frequency $f_{s}$ as an $O(\Delta=1/f_{s})$

Lowpass framework: An alternative approach for the approximation of a continuous time nonlinear statespace models is presented in [110]. Consider

[TABLE]

that is excited with an input signal $u(t)$ that is a lowpass signal of degree $d_{u}$ (the square root of the power spectrum is an $O(1/f^{d_{u}})$ for $f>f_{c}$ ). Observe that a ZOH-excitation is a lowpass signal with $d_{u}=1$ . Similar to the identification approach for the linear first order example in the previous section, a discrete time nonlinear state space

[TABLE]

is identified directly to the data. The discrete time signals $x_{d}(k),y_{d}(k)$ equal the sampled continuous time signals $x(t),y(t)$ within an error:

[TABLE]

$d_{x}=\textrm{min}(d_{u},d_{F})+1$ , with $d_{F}$ a characteristic of the nonlinear system as explained in Figure S15 [110]. For a a static discontinuous nonlinear function, $d_{F}=1$ , and $d_{F}=n+1$ if the $n^{th}$ derivative exists on the domain of interest [111, 110]. The error $\varepsilon_{I}$ is a user choice that is set by the choice of the complexity of the discrete time nonlinear state space. The discretization error is dominated by the aliasing effect that is due to the sampling of the input, output, and internal signals. Observe that the error in the direct identification approach is smaller than that of the TTS ZOH-approach. This is again due to the fact that the discrete time model is tailored to the continuous time data in the least squares minimization step.

This result provides a strong theoretical foundation for the popular and successful practice to identify directly explicit discrete time models of continuous time nonlinear systems.

Illustration: discretization of the Duffing oscillator

The results of the previous section are illustrated on the Duffing oscillator setup (Figure S6) described in “ Experimental setup: The forced Duffin oscillator”. In this case, the ZOH-output $x_{\text{ZOH}}(t)$ of the generator filter is filtered by a $4^{th}$ -order lowpass filter with a cut-off frequency of 200 Hz. A wide band zero mean excitation with a bandwidth of $f_{s}/2$ is used as the excitation signal in the frequency band [0 - 39062 Hz]. With these settings, the signal $u_{c}(t)$ has a flat amplitude spectrum up to 200 Hz, to be compared to the bandwidth of the second order system that is below 100 Hz.

The continuous time input and output are sampled at a high sampling frequency $f_{s}=78125\text{ Hz}$ , and next the data are subsampled at different rates [110]. For each subsampled data set a discrete time nonlinear state space model is identified on the measurements using a DT nonlinear polynomial state space model (PNLSS), using the methods described in [112] and “Extensive case study: The forced Duffing oscilator”. The model has 2 states, and the degree of the internal multivariate polynomial is 3 (it depends on both the states and the input). Higher orders and degrees were tested, but this did not significantly improve the results. The results are shown in Figure S16. It shows that the errors drop as an $O(1/f_{s}^{4})$ till the noise floor or the structural model error floor is reached. The drop rate is proportional to the drop of the alias errors, and much faster than $1/f_{s}$ .

Summary

Different options are available to approximate a continuous time system/model by a discrete time model.

•

Data driven approach: A nonlinear discrete time model can be identified directly from the experimental data. The approximation error is dominated by the alias error that is $O(1/f_{s}^{d_{x}-1.5})$ . The model complexity of the discrete time model can be higher than that of the continuous time counterpart in order to keep the discretization error below a user defined error level.

•

Model based approach: If a continuous time model is available, it is possible to turn it into a discrete time model. Two options are open for the user:

–

Truncated Taylor Series Approximate Model: The continuous time equations are transformed into a set of discrete time equations using a Taylor series approximation. The major advantage is that there is a close connection with the original (physical) equations. The main drawbacks are i) the restriction to ZOH excitations, ii) the approximation error is $O(f_{s}^{-1})$ which drops slowly with the sample frequency.

–

Simulation-Identification approach: The conitnuous time model is used to create a rich data set that is used as the input for a direct fit of a discrete time model. The major advantages are that i) a compact model is obtained with a user controllablle balance quality/complexity. ii) By a proper design of the data set, the user can focus the model on the intended application. The major disadvantage is the loss of physical interpretablility of the model.

Sidebar

External or Internal Nonlinear Dynamics

Dual representation of linear systems

Infinite impulse response (IIR) models: A linear system can be modeled by the recurrent representation (S31)

[TABLE]

The system (S31) can have an infinite memory (its impulse response has an infinite length) and is called for that reason an infinite impulse response (IIR) model.

Finite impulse response (FIR) models: The equivalent impulse response representation of (S31) is $y(t)=g(t)*u(t)=\sum_{k=0}^{\infty}g(k)u(t-k).$ Truncating this inifinte long impulse response to a finite length $n$ leads to the finite impulse response (FIR) model

[TABLE]

Dual representation of linear systems: From the previous discussion, it follows that linear systems can be either represented by an IIR model $y(t)=h_{IIR}(u(t),\ldots,u(t-n_{b}),y(t-1),\ldots,y(t-n_{a}))$ or an FIR model $y(t)=h_{FIR}(u(t),\ldots,u(t-n))$ (where $n$ can grow to infinity).

From behaviour point of view there is no difference between both representations. However, a structural difference is how the memory (dynamics) is created in both representations. The FIR model $h_{FIR}$ makes no use of an ’internal’ memory, the dynamic behavior is obtained by using delayed inputs. For that reason it will be called a model with “external” memory. The IIR model $h_{IIR}$ includes also delayed outputs to create an ’internal’ memory.

NFIR and NIIR models

Internal or external dynamics: By choosing $h$ in (S32) and (S31) to be nonlinear, the linear FIR-IIR classification can be generalized towards nonlinear NIIR systems $y(t)=h_{NIIR}(u(t),\ldots,u(t-n_{b}),y(t-1),\ldots,y(t-n_{a}))$ having internal dynamics (S32) or NFIR systems $y(t)=h_{NFIR}(u(t),\ldots,u(t-n_{b}))$ having external dynamics (S31) [48, 30]. For notational simplicity, the $h_{NFIR}$ and $h_{NIIR}$ will be both written as $h$ , the difference between both models follows from the arguments of the function.

For linear systems, it is a free users choice to select the FIR or IIR representations, there is a full equivalence between both representation. This is no longer the case for nonlinear systems. Not all NIIR systems can be modeled with a NFIR model as explained below. For that reason, the choice between NFIR and NIIR systems/models affects the structure, the behavior, and the stability properties. Also the numerical methods how to deal with these systems are strongly dependent upon it.

Structural aspects: The NFIR/NIIR nature of a system is uniquely linked to its topology: for NFIR systems there can be no dynamic closed loop around the nonlinearity while for NIIR systems the nonlinearity is captured in a dynamic closed loop.

This property is used in the structural detection framework [113]. It starts from the best linear approximation (BLA) that is identified for a varying input offset or amplitude [114]. The poles of the BLA remain fixed for NFIR systems while they move for NIIR systems.

Remark: The class of systems that is included in the NIIR representation can be further generalized by including also the nonlinear state space models that have a nonlinear feedback over some of the states, such that internal dynamic nonlinear closed loops are created.

Behavior aspects: Some typical nonlinear behaviors like chaos, bifurcations, jumps, moving resonances, autonomous oscillations (see Figure S17) can only be generated by NIIR systems. NFIR models cover fading memory systems whose behavior is ’closer’ related to that of a linear system. Fading memory systems [36], forget the past inputs asymptotically over time. Loosely spoken, it can be stated that a nonlinear system that has no fading memory requires a NIIR model.

NFIR systems are a subset of fading memory systems. All stable NFIR systems have a fading memory, unless the nonlinearity would be discontinuous. The most simple example of such an exception is a static discontinuous system. An NIIR system can have a fading memory behavior on a restricted input domain (for example the Duffing oscillator in "Linear simulation and prediction of the forced Duffing oscillator" in “Extensive case study: the forced Duffing oscilator”), and on that input domain NFIR models can be used to approximate the NIIR system. This is illustrated on the Duffing oscillator in [115].

However, the output of nonlinear NIIR systems can also show bifurcations and jump phenomena (small variations at the input can result in an a sudden ’qualitative’ or topological changes in the output), even for systems with smooth nonlinearities [116, 117, 98, 99]. The output can become even chaotic, or autonomous oscillations can appear. An NFIR model cannot model these phenomena, illustrated in Figure S17.

Stability: While the stability of NFIR systems can be guaranteed under very general conditions, it is much harder to analyse the stability of NIIR systems. Although general theories exist to analyse and impose stability on NIIR systems [118], often the user is left with extensive simulations in order to check stability [30].

Numerical aspects: The numerical optimization aspects of nonlinear systems/models are strongly affected by moving from NFIR to NIIR. Calculating the derivative of the output of a NIIR system with respect to the model parameters boils down to computing the output of another nonlinear model [119].

Table II gives an overview of NFIR and NIIR systems that are considered in this article. The reader is referred to [19] for a more extensive table of different models and behaviors.

Summary - User guidelines

•

Internal or External dynamics models: The choice between an external dynamics NFIR model (no nonlinear closed loop) or an internal dynamics NIIR model (nonlinear closed loop present) strongly influences the complexity of the identification problem.

•

Initial tests or insight: Initial tests or insight can help to choose between the two model classes. This avoids that a lot of time and effort is wasted by trying to fit the wrong model structure to the data.

•

External dynamics model class: Only systems with a fading memory behavior can be modeled, with a small error, with an external dynamics model. Typical models are: NFIR, Volterra, open loop block-oriented models (e.g. Wiener, Hammerstein, Wiener-Hammerstein and Hammerstein-Wiener). These models can only be used if the poles of the BLA do not move for varying experimental conditions.

•

Internal dynamics model class: Strong nonlinear behaviors like moving resonances, jump phenomena (bifurcations), hysteresis, autonomous oscillations can only be modeled with internal dynamics (NIIR) models. External dynamics (NFIR) models are not able to represent such phenomena. Typical models are: NARX, block-oriented models with feedback, nonlinear state space models. Whenever the poles of the BLA move for varying experimental conditions, this class of models is to be preferred.

Sidebar

Impact of Structural Model Errors

Introduction

Any estimated model has some error. It is important to distinguish between two error sources

•

Structural model errors: These are errors that come from deficiencies in the model structure. The model is simply not capable of producing correct model outputs. Even with an infinite amount of perfect estimation data, the model output will have errors.

•

Random model errors: The disturbances present in the estimation data will affect the model, so even when there are no structural model errors, the model output will have errors.

Here these concepts will be defined more precisely. For simplicity only the case of output error models (or simulation models), will be treated, see “Simulation Errors and Prediction Errors” (S17): $y(t)=y_{0}(t)+v(t)$ . The term $v(t)$ models the disturbances which are assumed, for simplicity in this section, to be additive and independent of the input.

For a simulation model, the predictions of $y(t)$ will only depend on past inputs:

[TABLE]

Clearly, the output $y(t)$ contains an innovation $\nu(t)=y(t)-\hat{y}(t|\mathcal{S})$ which cannot be predicted by any model, even not with the true system. “Model errrors” $y_{\varepsilon}$ will always be in addition to $\nu$ . Consider two cases

•

The true system is in the model class, $\mathcal{S}\in\mathcal{M}$

There exists a $\theta_{0}$ such that $h_{0}(u^{t-1})=h(u^{t-1},\theta_{0})$ . Then normally the estimate $\hat{\theta}_{N}\to\theta_{0}$ as $N\to\infty$ [6], and the model parameter error $\theta_{0}-\hat{\theta}_{N}$ will be a random error caused by data disturbances $v(t)$ . So the model error $h(u^{t-1},\theta_{0})-h(u^{t-1},\hat{\theta}_{N})$ will be defined by the random error in $\theta_{0}-\hat{\theta}_{N}$ , and

[TABLE]

•

The true system is not in the model class, $\mathcal{S}\notin\mathcal{M}$ .

Define

[TABLE]

The subscipt $u$ is to emphasize that $\theta^{*}_{u}$ will depend on the (statistical properties of the) input $u$ . The output model error $y(t)-\hat{y}(t|\mathcal{M}(\hat{\theta}_{N}))$ can now be decomposed as

[TABLE]

Denote the structural model error in (S37) as $y_{\varepsilon}(t)=h_{\varepsilon}(u^{t-1})$ . Observe that this error depends upon the input.

Remarks

Under mild ergodicity properties of estimation data disturbances the parameter $\theta^{*}_{u}$ in (S36) obeys

[TABLE]

which stresses that $\theta^{*}_{u}$ is the “best model available in the set $\mathcal{M}$ (for the chosen input)”. 2. 2.

For some applications it may be of interest to consider a family of different input properties and configurations, $u\in\mathcal{X}$ . If these configurations are equipped with a probability measure, the “best model for the family $\mathcal{X}$ ”, $\theta_{BAM}$ can be defined as

[TABLE]

Experiment design

A system describes that part of reality that is of importance for the user. The main idea of system theory is to make this description independent of the actual inputs that are applied, creating a clear split between the system characteristics and the signals on which the system acts. For the system identification theory, it implies that the excitation signal does not affect the system, and hence the model should not depend upon it. In the presence of structural model errors and approximate modeling, this paradigm does no longer hold. The approximate model is only valid around a given working point in a restricted input domain where the structural model error $h_{\varepsilon}(u(t))$ is acceptably small. The approximate model depends on the working point and on the class of inputs, so that a major advantage of the system theoretic framework is lost. Nevertheless, this is the best that can be done if it turns out that a complete model class that includes the true system would be too complex.

Illustration on a static nonlinearity $y_{s}(t)=au(t)$

The dependency of the model on the excitation class is illustrated in Figure S18 where the true system is given by $h_{0}(u(t))=\arctan(u(t))$ , and the simplified model is $y_{s}(t)=au(t)$ . The disturbing noise $v(t)$ is put to zero to focus completely on the impact of the excitation class on the approximate model. A clear dependency of $\theta^{*}_{u}$ on the input distribution can be observed.

The results of this simple illustration are generally valid. Whenever a complex system is approximated with a simplified model, the results will depend on the actual applied inputs. For that reason the experiment should be designed such that it covers the input domain of interest. It should be tuned to the problem to be solved so that the structural model errors $y_{\varepsilon}(t)$ remain acceptable small for the user. This will set the actual subdomain that needs to be covered. This is illustrated on the Duffing oscillator example in Figure 7 and Figure S10. Within this restircted domain, it still remains important to select signals that are sufficiently rich to collect as much information as possible within the tolerable cost of the experiment.

Example: Design of an excitation for the Wet-Clutch setup

A simulation model of the Wet-Clutch setup in Figure S19 was used during an iterative control design (ILC) for the system. The goal was to obtain a fast but smooth engagement of the clutch. To reach that goal, spiky signals with a large amplitude are used to get a short filling phase, followed by a small excitation to get a smooth engagement. The design of rich excitation signals that mimic this behavior is discussed in Figure S20.

Choice of the cost function

No structural model errors present

In linear system identification, the classical choice for the cost function is [6, 7]

[TABLE]

with $G$ the plant model, and $H$ the noise model. This can also be written in the frequency domain[6] (neglecting the begin and end effects that create leakage errors [121]) as

[TABLE]

$U(k),Y(k)$ are the DFT of $u,y$ evaluated at the frequency $f_{k}=\frac{kf_{s}}{N}$ , and $G(k,\theta),H(k,\theta)$ are the plant and noise transfer function evaluated at $f_{k}$ . In the prediction error framework, the parametric plant and noise model are simultaneously estimated by minimizing the cost fucntion $V_{N}(\theta)$ with respect to $\theta$ . It is shown [6] that this is the maximum likelihood formulation of the identification problem if the disturbing noise $v$ is Gaussian distributed.

An alternative is to replace the parametric noise model by a nonparametric measurement of the noise variance $\sigma_{v}^{2}(k)=|H(k,\theta)|^{2}$ . The variance $\sigma_{v}^{2}(k)$ is directly estimated from the data in a nonparametric preprocessing step, using periodic excitations [8, 121] (see “Nonparametric Noise and Distortion Analysis Using Periodic Excitations”), or starting from arbitrary excitations using the recently developed nonparametric estimation methods [122]. Observe that in this approach, the estimation of the noise model does not depend upon the plant model, so that a too simple plant model will not affect the noise model. The cost function becomes

[TABLE]

In the absence of structural model errors there is a full equivalence between both approaches [123, 8], the differences are mostly on the implementation side [124]. This picture changes completely if structural model errors are present.

The discussion in the section can be directly generalized to nonlinear systems by replacing the linear model $G(q,\theta)u(t)$ by the nonlinear model $h(u,\theta)$ . A further generalization would be to include also a nonlinear noise model to deal for example with process noise (see "Process Noise in Nonlinear System Identification").

Structural Model errors present

In the presence of structural model errors, the parametric noise model $|H(k,\theta)|^{2}$ will account for the power in both the disturbing noise $v(t)$ and the structural model errors $y_{\varepsilon}(t)$ . Moreover, the maximum likelihood motivation is no longer valid because the structural model errors are not independent of the input $u$ .

The nonparametric noise analysis based on periodic excitations will still estimate $\sigma_{v}^{2}(k)=|H(k,\theta)|^{2}$ , while for the advanced nonparametric methods [125, 126, 127] a combination of the disturbing noise variance and the mean squared structural model errors will be retrieved.

This raises the question what is the best choice for the weighting function in (S40), (S41), (S42) in the presence of structural model errors. It makes no sense to weight the structural model errors with the noise variance if the structural model errors dominate the noise. The weighting function should rather reflect in what frequency band larger structural model errors can be tolerated and where small structural model errors are needed. This is done by replacing the noise variance based weighting $H^{-1}(q,\theta)$ in (S40) or $1/\sigma_{v}^{2}(k)$ in (S41) by a predefined frequency weighting $L^{-1}(q)$ or $1/w^{2}(k)$ , chosen by the user, that reflects the most acceptable behavior of the structural model errors for the application in mind:

[TABLE]

Observe that, in contrast to (S40), the weighting filter $L^{-1}(q)$ does no longer depend on $\theta$ . In the frequency domain, the cost function (S43) is

[TABLE]

By taking the sum only over the frequencies of interest $f\in F$ , the fit is focused on the frequency band of interest.

Covariance matrix expressions

The covariance matrix of the parameter estimates $C_{\theta}$ for the linear-least-squares problem $y=K(u)\theta+w$ , where $y,w\in\mathbb{R}^{N\times 1}$ , $\theta\in\mathbb{R}^{n_{\theta}\times 1}$ , and $K(u)\in\mathbb{R}^{N\times n_{\theta}}$ is given by [6, 7, 8]

[TABLE]

No structural model errors present: $w(t)=v(t)$ and $y_{\varepsilon}(t)=0$

In this case, $w(t)$ is independent of $u$ , and (S45) converges for $N\rightarrow\infty$ to

[TABLE]

Making use of the independency of $w$ and $u$ , it follows that $w$ is independent of $K(u)$ , so that

[TABLE]

and the covariance matrix becomes

[TABLE]

For $w$ being white noise, $C_{w}=\sigma^{2}_{w}I$ , and

[TABLE]

Structural model errors present: $w(t)=v(t)+y_{\varepsilon}(t)$

In the structural model error case, $K(u)$ and $w$ depend both upon the input $u$ , so that the independency is lost, and $E_{u,w}[K(u)^{T}w^{T}wK(u)]\neq E_{u}[K(u)^{T}E_{w}[w^{T}w]K(u)]$ . In that case, higher order moments of $u$ show up in the calculation of $E_{u,w}[K(u)^{T}w^{T}wK(u)]$ , and the general expression (S46) should be used. Since the dependency of $w$ on $u$ is not known (the structural model error is unknown), it becomes in general impossible to get closed form expressions for the covariance matrix $C_{\theta}$ . This creates a huge problem because the classical but simplified expression (S49) underestimates the variability, providing the user with a far too optimistic estimate of the uncertainty on the estimates [17]. Estimating the variability directly from a set of repeated experiments with varying inputs is a pragmatic solution to this problem, but provides of course no closed form expressions [17].

Example 1: linear approximation of $y=u^{n}$ It is shown in [128] that the underestimation of the variance is maximal for static nonlinear systems. These can be approximated arbitrary well in mean square sense using a polynomial representation. For that reason, the study of static nonlinear systems $y_{0}(t)=u(t)^{n}$ is very informative. For such a system, excited with Gaussian noise, the best linear approximation $G_{BLA}$ is constant (see "Linear Models of Nonlinear Systems: More on the best linear approximation $G_{BLA}$ ") [129, 130, 131, 132, 8], and hence, $G_{BLA}(q,\theta)=a_{BLA},$ which is only different from zero for $n$ odd. It is also possible to explicitly calculate the ratio between the full nonlinear-induced variance (structural model errors present, input dependent residuals) and the classical variance (no structural model errors, input independendent residuals) of $\hat{a}_{BLA}$ [128]

[TABLE]

This shows that the underestimation of the variance by (S49) grows with the nonlinear degree $n$ .

Example 2: the forced Duffing oscillator A linear approximating model is estimated to the silverbox from the experimental data (tail part) discussed in "Simulation errors and prediction errors. The tail is split in 10 sub-records with a length of 8692 points, and each of these is used to identify a second-order discrete-time plant model and a sixth-order noise model using the Box-Jenkins model structure of the prediction-error method [6, 7]. The estimated second-order plant transfer function, is shown in Figure S21. The estimation procedure resulted in the plant and noise model. From this information it is possible in the structural model error free system identification approach to obtain also an estimate of the variance on the results. In Figure S21, the estimated standard deviation of the transfer function is compared with the sample standard deviation that is calculated from the repeated estimates on the 10 subrecords. Both curves look very similar, but the model-based estimated value (green) underestimates the actual observed standard deviation (red) by 50% or more. This is due to the fact that the system identification framework fails to estimate precisely the uncertainty in the presence of nonlinear structural model errors. The user should keep in mind that, whenever structural model errors are present, the confidence bounds are wrong.

Example 3: Wind tunnel experiment In Figure S22, the under estimation of the variability in the presence of nonlinear model errors is illustrated on a wind tunnel experiment. In this experiment [133], the transfer function of the best linear approximation (see “Linear Models of Nonlinear Systems”) is measured from the forced displacement (a random phase multisine) [8] at the root of a wing mounted in a wind tunnel (a,b), to the acceleration of the tip of the wing (b). The simplified variance analysis (neglecting the dependency of $w$ on the input) and the actual observed variance obtained from different realizations of the experiment are shown. The simplified analysis under estimates the actual observed standard deviation with 11 dB (about a factor 3).

Optimal strategy to generate simplified models

A key issue in system identification is how to cope with high system complexity. Sometimes structural model errors are unavoidable because too complex models are needed to include the system in the model class. In other situations structural model errors are deliberately created because simple models are needed in the next phase of the (control) design process. In that case the experiment can be designed such that some complex system behavior is concealed and the simple model still performs well on the domain of interest [12].

Identifying simplified models leads to structural model errors and brings all the questions discussed before into the picture. This raises also the question for the best strategy: 1) Identify first the most complex model that is affordable (reduce the structural model errors as much as possible), followed by a model reduction step, or 2) Identify directly a too simple model, dealing directly with the structural model errors.

In Appendix VI: A Separation Principle of [12] it is shown, using the Maximimum Likelihood Invariance principle, that the first strategy is the best choice to follow, if it is affordable, because it results in an asymptotically efficient estimate (smallest variance) that is consistent (retrieve the ’true’ value if the number of data goes to infinity). Moreover, it allows the user to separate the identification and model reduction step: use the maximum likelihood framework in the first step (resulting in efficient estimates with the lowest uncertainty), followed by a model reduction step using an application oriented cost function. This two step approach allows the user also to make a proper characterization of the reliability of the simplified model. For that reason this is the best strategy whenever it is affordable.

However, often this first option is not affordable, and then the only solution is to identify directly a simple model in a setting with structural model errors. In that case, the following user guidelines help to face that situation.

User guidelines: how to deal with structural model errors?

•

Experiment design: The experiment should come as close as possible to the future applications so that the structural model errors are guaranteed to be small under these conditions. For example, the best simplified model can be completely different for a Gaussian or a sine excitation. The same holds true for varying power spectra, amplitude ranges, and operating points.

•

Choice of the cost function: When structural model errors dominate, the maximum likelihood paradigm that proposes a cost function based on the disturbing noise properties is no longer the natural choice. The cost function should reflect the users needs and express where smaller structural model errors are needed and larger structural model errors can be tolerated. This can be done by using for example a relative error in the cost, or by proposing a user defined weighting function.

•

Frequency weighting: The frequency weighting of the errors should no longer reflect the disturbing noise variance when the structural model errors $y_{\epsilon}$ dominate the disturbing noise $v$ . Instead, the weighting should be chosen to make sure that the structural model errors remain small in the frequency band of interest.

•

Did we loose 50 years our time? From the discussions in this section, the reader could get the impression that the past efforts on the development of a system identification framework are in vane when structural model errors show up. This is certainly not true! The lessons learned from the classical system identification approach should not be forgotten, the clearly structured picture that is provided in the classical text books like [6, 7, 8] is still valid at full power and provides still a road-map how to organize the system identification process.

Improper data handling can over-emphasize small noise disturbances, making the identification process again vulnerable to noise disturbances that are far below the structural model error level. For that reason the user is still strongly advised to make a clear split between the experiment design, the model class selection, the choice of the cost function, and the choice of the numerical procedures used to minimize the cost function. The consistency analysis developed in the structural model error free framework still provides insight about the convergence of the algorithms in the presence of structural model errors. The major open issue that is still unsolved today is how to generate reliable uncertainty bounds in the presence of structural model errors.

•

Uncertainty bounds: The covariance matrix that is generated by the structural model error free system identification framework is wrong in the presence of structural model errors. This is still like that if the noise model is tuned to include also the structural model errors. The covariance matrix underestimates the true value because it does not account for the dependency between the structural model errors and the input. This is well in line with the rule of thumb: do not pay any attention to the model’s uncertainty bounds if the validation test fails. At this moment there is no theoretical framework available that can provide more reliable bounds. For that reason the user is advised to repeat the experiment for different excitation signals (for example different realizations of a random excitation) and to look directly to the variability of the results under these conditions (see Figure S21) [17].

Sidebar

– Linear Models of Nonlinear Systems

Nonlinear models are clearly much more versitile and complicated than linear models. A general linear model can be represented and fully characterized by a Transfer function $G(s)$ which is the Laplace transform of the model’s impulse response $g(\tau)$ :

[TABLE]

(In discrete time the transfer function $G(z)$ is the $Z$ -transform of the impulse response.)

Due to the simplicity of linear models, it is tempting and common to work linear approximations of nonlinear systems. Possibly one can work with several linear models to capture different aspects of the nonlinear system. Two ways of defining linear approximations will be used:

Linearization around an Equilibrium

Consider a general nonlinear state space model:

[TABLE]

Suppose the input is constant: $u(t)=u^{*}$ , and that there is a corresponding state equilibrium $x^{*};\quad f(x^{*},u*)=0$ . Define the deviations

[TABLE]

By expanding the nonlinear functions $f,h$ in Taylor series around $x^{*},u^{*},y^{*}$ and neglecting terms of higher order, we obtain a linear state space equation

[TABLE]

So in the vicinity of the equilibrium, the nonlinear system (S52) can be approximated by the linear transfer function $G(s)=C(sI-A)^{-1}B$ . This is a well known and commonly used linearization. Corresponding expressions apply in discrete time.

Stochastic Linearization

Suppose the nonlinear system (S52) is excited by an input $u$ with a spectrum $\Phi_{u}(\omega)$ , and the cross-spectrum between output and input $\Phi_{yu}(\omega)$ is well defined. The second order properties of the input and output signals (i.e the covariance functions and spectra) are thus well defined. For this input, define the linear model [135]:

[TABLE]

that has the same second order properties as the nonlinear system. By considering second order signal properties (for this input) the nonlinear system (S52) thus cannot be distinguished from the linear model $G_{BLA}$ , They are second order equivalents.

This also means that a linear model is estimated with standard linear identification methods (that only use the second order propertires of the signals) the estimate will converge to $G_{BLA}$ . For that reason, the linear second order equivalent is also called BLA: Best Linear Approximation. Note that the BLA of a nonlinear system will depend on the input signal spectrum!

Note also that the definition of spectra does not require a stochastic setting. It is sufficient that the following limita exist (in discrete time)

[TABLE]

for the input -output signals [135], and correspondingly for $\Phi_{u}$ .

More on The best linear approximation $G_{BLA}$

In this section, the best linear approximation model is further analyzed. The reader is referred to [15] for a more extensive introduction.

The best linear approximation $G_{BLA}$ in (S55) represented either by its impulse response $g_{BLA}(t)$ , or its FRF $G_{BLA}(\omega)$ is the solution of [8, 136, 122]

[TABLE]

with $q$ the shift operator for a discrete time model. Similar expressions can be given for continuous time models. All expected values $E\left\{\right\}$ are taken with respect to the random input $u(t)$ . In most applications, the DC-value of the input and output signal should be removed in order to obtain a model that is valid around a given setpoint.

The transfer function $G_{BLA}(k)$ depends on the characteristics of the input signal. Changing the power spectrum or the amplitude distribution (for example, replacing a Gaussian by a uniform distribution) of the excitation will change the BLA.

The nonlinear ’noise’ source $y_{s}(t)$ : The difference between the output of the nonlinear system and that of the BLA $y_{s}(t)=y(t)-G_{BLA}(q)u(t)$ is called the stochastic nonlinear contribution or nonlinear noise. Although this name might be misleading (the error is deterministic for a given input signal), it is still preferred to call it a stochastic contribution because it looks very similar to a noise disturbance for a random excitation [137, 8]. Because $y_{s}(t)$ is the residual of a least squares fit, it is uncorrelated with the input. However, in general, it is still dependent on the input. The properties of $G_{BLA}$ and $Y_{s}$ are well known for Gaussian and Rieman-equivalent [122] excitations.

A new paradigm: Combining both results, the output of the nonlinear system can be written as

[TABLE]

where $y_{s}(t)$ is uncorrelated put dependent on the input $u(t)$ .

Experimental illustration on the Duffing oscillator: In this example, measurements on the forced Duffing oscillator are shown (see [15] for more details). The FRF is measured for 4 different excitation levels and shown in Figure S23. For each excitation level, the FRF is averaged over 50 realizations of the input signal to obtain a smoother result. Two observations can be made: i) The resonance frequency shifts to the right for increasing excitation levels, and ii) the measurements become more noisy. Both effects are completely due to the nonlinear distortions.

The BLA of a static nonlinearity: a simple example: The best linear approximation of a static nonlinearity excited with a Gaussian excitation is a constant [129]. For example, the BLA of $y=u^{3}$ is

[TABLE]

Observe that

[TABLE]

is uncorrelated but dependent upon $u$ . This is a general valid observation.

Impact of process noise on the BLA: Process noise $w(t)$ coming into the system before the nonlinearity creates mixing terms with the input signal so that at the output of the system the process noise contributions are no longer independent of input. This will also affect the BLA, and a generalization of the framework is needed. The reader is referred to [138] for a full discussion, here the simple example (S59) is extended as an illustration. Assuming that the process noise $w(t)$ is independent of the input $u(t)$ , the BLA becomes

[TABLE]

This shows that the ’averaged’ behavior of a nonlinear system is strongly affected by the presence of process noise entering the system before the nonlinearity.

Summary

•

Best linear approximation $G_{BLA}$ : It is possible to identify a simplified representation, for example the linear model $G_{BLA}$ , for a nonlinear system.

•

Stochastic nonlinearities $y_{s}$ : For a random excitation, the structural nonlinear model errors, called stochastic nonlinearities $y_{s}$ , look like noise. $y_{s}(t)$ is uncorrelated but not independent of the input $u(t)$ . For an untrained user, it is very hard to distinguish process or measurement noise from $y_{s}$ (the nonlinear model errors). Different actions are needed to deal properly with both effects (see [15]).

•

$G_{BLA}$ depends on the nature of the input: The best simplified approximation ( $G_{BLA}$ ) depends on the nature of the excitation. Changing the power spectrum or input distribution can change $G_{BLA}$ .

•

Structure detection: The variations of $G_{BLA}$ for varying experimental conditions provide insight on the structure of the nonlinear system. For example, moving poles (resonances) are only possible for nonlinear closed loop systems. The reader is referred to [113] for a full overview.

•

Process noise: Unmeasured random inputs (process noise) coming into the system before the nonlinearity can affect the output in a systematic way. The BLA with respect to the known input depends on the process noise properties.

Sidebar

Process Noise in Nonlinear System Identification

Process noise $w(t)$ , as defined in (5), affects also the system’s internal signals and not only the measurements. In this section, the impact of process noise is studied in more detail. The process noise $w(t)=H_{w}(q)e_{w}(t)$ is considered to be a zero mean, white or colored, noise variable.

Process noise in linear systems: It is well known that for linear systems, the effects of process noise can be collected at the output as an additive noise term $H_{w}(q)w(t)$ , where it is combined with the measurement noise $v(t)=H_{v}(q)e(t)$ into one noise term:

[TABLE]

The disturbances $e_{w}(t),e_{v}(t)$ are mutually independent distributed, and also to be independent of the input $u(t)$ . Under these conditions, the combined effect of the process and the measurement noise result in a zero mean distributed output disturbance that is independent of the input.

Process noise in nonlinear systems: In nonlinear systems, process noise can have a structural impact on the identified models as was illustrated in (6) and (S61). Zero mean process noise can change for example the linear gain of a nonlinear system, and the apparent disturbances at the output can become nonstationary and depend upon the input as discussed in more detail later in this section.

Process noise: curse or blessing?

Curse: In most applications, process noise is very disturbing. It is more difficult to control a system in the presence of process noise. Also the system identification methods become much more involved as explained below. For that reason, process noise is mostly considered as a very annoying effect.

Blessing: However, in some applications process noise is used to reduce or linearize the averaged effect of abrupt nonlinearities, this is called dithering. Desired effects of dithering are augmenting the linearity of the open or closed loop system, increasing the robustness and asymptotic stability [130, 139, 140]. Dither can be used to reduce the effect of Coulomb friction, dead zones in hydraulic valves valves and hysteresis effects, but this can come with an increased wear due to the rapid motions [140]. Dithering is also employed in digital instrumentation and data acquisition systems in order to improve their measurement-related characteristics. A typical example is the use of dithering in an analog-to-digital converter (ADC) to improve the resolution, dynamic range, and spectral purity below the quantization level [141].

Detection of the presence of process noise: The dependence of the process noise output $y_{p}(t)$ on the input $u(t)$ can be used to detect it, even in the presence of measurement noise. In [142], periodic non-stationary input signals are used. Assuming that a periodic input results in a periodic output, the process and measurement noise $\tilde{v}(t)$ can be estimated as the non-periodic part of the output. Next the variance $\sigma_{\tilde{v}}(t)$ is estimated. Assuming that the measurement noise is stationary, the presence of the process noise is revealed by a time-varying variance as illustrated in Figure S24.

Impact of process noise on the system identification problem: The presence of process noise increases the complexity of the system identification problem significantly. The nonlinear operations change the distribution of the process noise $w(t)$ , and can turn a least squares problem formulation at the output of the system to be very inefficient and even strongly biased. For that reason, the output error framework, that assumes that all noise enters at the output of the system has to be abandoned. Instead, the procedure should start from the joint (Gaussian) distribution of $w(t),v(t)$ . The discussion in the following is focused on the identification of a Wiener system to set the ideas. Following the ideas in [144], the likelihood function can be easily generated using the intermediate nuissance variable $x(t)$ in Figure S25 from the following probability density function

[TABLE]

The calculation of this integral becomes very difficult because it is not possible to eliminate $x(t),t=1,\ldots,N$ analytically, with $N$ the number of data points. Today, dedicated numerical methods are developed to deal with these high dimensional integrals. The reader is referred to “Identifying nonlinear dynamical systems in the presence of process noise” for a first introduction.

Summary

•

Process noise: The process noise contributions to the output of a nonlinear system may no longer Gaussian distributed and independent of the input.

•

Detection: Applying nonstationary periodic excitations, it is possible to detect the presence of process noise.

•

Identification: If no process noise is detected, the simpler output error formulation can be used to identify the nonlinear model. If the process noise turns out to be large, more advanced identification approaches are needed to guarantee consistent and efficient estimates.

Sidebar

Identifying Nonlinear Dynamical Systems in the Presence of Process Noise

This section was written by Thomas B. Schön, Department of Information Technology, Uppsala University, Sweden.

A rather general formulation to be used when identifying nolinear dynamical systems is arguably provided by the nonlinear state-space model which represent a system with input signal $u_{t}$ and output signal $y_{t}$ in terms of a latent Markovian state $x_{t}$ ,

[TABLE]

Here the nonlinear functions $f(\cdot)$ and $h(\cdot)$ represent the dynamics and the measurements, respectively. The variables $w_{t}$ and $v_{t}$ describe the process noise and the measurement noise, respectively. Finally, the unknown static model parameters are denoted by $\theta$ and the initial state is given by $x_{0}\sim p(x_{0})$ for some distribution $p(\cdot)$ .

The problem to be solved is the identification of the unknown parameters $\theta$ in (S64) based on observed inputs $u^{T}=\{u_{t}\}_{t=1}^{T}$ and the corresponding outputs $y^{T}=\{y_{t}\}_{t=1}^{T}$ .

The maximum likelihood formulation amounts to

[TABLE]

where the nature of the intractability of the likelihood is revealed by

[TABLE]

To stress the point that the process noise enters this integral, the following alternative integral can be considered:

[TABLE]

where it is clear that the process noise enters the integral via the term $p(x_{t}\mid x_{t-1})=p_{w_{t}}(x_{t}-f(x_{t-1},u_{t-1};\theta))$ .

More specifically, the challenge lies in that the predictive state distribution $p(x_{t}\mid y^{t-1},\theta)$ cannot be explicitly computed, and approximations have to be made. The sequential Monte Carlo (SMC) methods [145] like particle filters and smoothers can be used to compute this distribution arbitrarily well. These methods were introduced in the beginning of the 1990s [146], but it is still a research area which is very much alive with new results emerging all the time. The idea behind these methods is to maintain an empirical distribution

[TABLE]

made up of samples $\{x_{t}^{i}\}_{i=1}^{N}$ (sometimes referred to as particles) and their corresponding weights $\{W_{t}^{i}\}_{i=1}^{N}$ . Here $\delta_{x_{t}^{i}}(x_{t})$ denotes the Dirac delta. The SMC methods describe how to update these weights over time in such a way that the estimate (S68) converge to the true underlying as the number of samples $N\rightarrow\infty$ . The theoretical basis underpinning these methods is by now quite extensive, an entry-points into this literature is [147, 145]. What is perhaps most relevant for our present discussion is that the SMC method is capable of producing unbiased estimators of the likelihood [147, 148]. The likelihood estimate is obtained by inserting (S68) into (S66).

Based on the above, noisy unbiased estimates of the cost function in the maximum likelihood problem can be computed. As with any optimization problem—especially when faced with a stochastic optimization problem as it is here—it is easier to solve if there are also gradients available. The SMC method can be used for this as well, see e.g. [149, 150]. Driven by deep learning, the state of the art when it comes to solving stochastic optimization problems is evolving quite rapidly at the moment, see e.g. [151].

As an alternative to the above solution, the expectation maximization (EM) method [152] can be used to solve (S65) in the presence of process noise, see e.g. [153, 154]. EM is an iterative algorithm that based on a current iterate $\theta_{k}$ computes an approximation of the so-called intermediate quantity

[TABLE]

which is then maximized to find the next iterate $\theta_{k+1}$ . This procedure is then repeated until convergence and it is guaranteed to stop at a stationary point of the likelihood surface. The SMC method is key here as well, in that it allows us to approximate the smoothing distribution $p(x^{T}\mid y^{T},\theta_{k})$ in (S69) arbitrarily well.

The Bayesian approach is of course also highly interesting for nonlinear system identification in general [155] and for the problem when there is process noise present in particular. The slight variation to the above is that now the unknown parameters are assumed to be random variables instead and rather than computing a point estimator (like (S65)) the posterior distribution $p(\theta\mid y^{T})$ is computed. The breakthrough came in 2010 with the introduction of the so-called particle Markov chain Monte Carlo (PMCMC) methods [156].

Tutorial overviews of how to identify nonlinear state-space models using SMC are provided by [157, 158]. Concrete system identification examples were there is significant process noise available are provided in for example [153, 159, 160].

Sidebar

Black Box Models Complexity:

Keeping the Exploding Number of Parameters under Control; Increased Structural Insight; Model Reduction

Black box modeling is a very flexible method that requires little or no physical insight of the user. This comes with the cost of an exploding number of model parameters for a growing complexity, leading to an increased risk of overfitting [6, 7, 8]. Regularization and data driven structure retrieval tools are developed to keep this number of growing parameters or their effect on the modeled output under control. Both approaches are discussed below.

Regularization

The most simple approach to obtain a simplified model is to set the ’least significant’ parameters equal to zero in a model pruning step using manual trial and error methods. Regularization methods replace manual tuning by automatic procedures [161, 162]. The basic idea is to add an additional term $R(\theta)$ to the cost function (S43), imposing an extra constraint on the parameters

[TABLE]

The choice of $R$ sets the behavior of the regularized solution, leading to sparse or smooth solutions.

Sparse models: Putting $R(\theta)=\lambda|\theta|$ leads to LASSO [163], a method that puts as many parameters as possible equal to zero which leads to sparse models.

Smooth models: A quadratic regularization $R(\theta)=\theta^{T}P^{-1}_{\theta}\theta$ leads to a milder regularization than LASSO. The regularization term pulls the estimates toward zero, resulting in a reduced output variance at a cost of an increased bias. An optimal bias/variance balance is made that minimizes the mean square error of the output as illustrated in Figure S26.

The success of this approach strongly depends on a proper choice of the regularization matrix $P$ that reflects the additional user knowledge or user desires, also called prior information, that is added on top of the data. It can be based on physical insight, or impose for example the user desire for a smooth solution. The reader is referred to [161]. Observe that in this approach the number of parameters is not reduced, but instead their freedom to vary independently is restricted, leading to a smaller “effective number of parameters”.

Imposing smooth and exponentially decaying solutions is illustrated for nonparametric Volterra models in the examples section on Black Box Volterra Model of the Brain [89].

Data driven structure retrieval

An alternative approach to reduce the number of model parameters is to impose more structure on the model. In black box modeling, the structural information should be retrieved from the data rather than using physical information. The explosive growth of the number of parameters in nonlinear black box modeling is due to the parameterization of the multivariate nonlinear function (8) that is present in every nonlinear model. Retrieving more efficient presentations of the function $F\in R^{n_{q}\times n_{p}}$ is a key to reduce the number of parameters, and keep the model flexibility under control. Recently, decoupling methods are developed that allow the multivariate function $F$ to be written as a combination of linear transformations and a well selected set of single-input single-output (SISO) nonlinear functions as shown in Figure S27. In “Decoupling of Multivariate Polynomials”, a tensor based decoupling approach is explained in more detail.

A decoupled representation offers major advantages:

•

The combinatorial grow of the number of parameters is reduced to a linear grow as a function of the complexity. For example, if $F$ is a multivariate polynomial of degree $d$ , the number of parameters drops from $n_{q}O(n_{p}^{d})$ to $n_{q}O(rd)$ , with $r$ the number of internal SISO branches.

•

The decoupled representation of $F$ is easier to interprete, giving more intuitive access to the nonlinear behavior of the system. For example, it is much simpler to plot a set of SISO nonlinear functions than to make a graphical representation of a high dimensional multivariate nonlinear function because only slices of the multivariate function can be shown.

•

Tuning the number of branches in the decoupled representation is not only a tool to reduce the flexibility of the model, it can also be used to make a user selected balance between structural model errors and model complexity which opens the road for model complexity reduction.

Applications: The decoupling approach can be applied on a variety of problems. In the “Extensive case study: The forced Duffing oscilator”, a detailed illustration of the decoupling approach is given. An early application of the decoupling strategy [164] proposes a tensor based decoupling method to decouple on a separate basis the Volterra kernels of different degree. Next, this idea was further refined and applied to the identification of parallel Wiener and parallel Wiener-Hammerstein block-oriented models [165, 166, 167]. The full decoupling method [168] that is described in this section is used in [169] to decouple a nonlinear state space model for the Bouc-Wen hysteresis problem. Recently, the ideas were further generalized to apply the decoupling idea to the pruning of NARX models [100]. Eventually, the decoupling ideas are also directly applied to the design of decoupled controllers [170].

Decoupling of Multivariate Polynomials

Representations that increase the structural insight, and reduce the number of model parameters are most welcome. The approach that is briefly described in this section is discussed in full detail in [168]. The starting point is a set of multivariate basis functions, for example polynomials, that needs to be unraveled into a simplified structure. The cross-links among input variables are ‘decoupled’ into single-variable functions by considering linear transformations at the input and output of the static non-linearity. These transformations reveal internal variables between which univariate relations hold. Consider the variables $\textbf{p},\textbf{q}$ and the multivariate vector function f with proper dimensions.

[TABLE]

The entry $i$ of g is a univariate function $\textbf{g}_{i}=g_{i}(x_{i})$ , $i=1,\ldots,r$ , with $\textbf{x}=V^{\top}\textbf{p}$ . In this way, the model can be given again a physical/intuitive interpretation while at the same time the number of parameters decreases. A graphical representation is given in Figure S27, each univariate function is called a branch, and $r$ is the number of branches that is used in the decoupled presentation. In [171] an exact decomposition method is proposed to obtain a decoupled representation.

Example: Exact decomposition of a multivarite polynomial Consider the polynomials $f_{1}(p_{1},p_{2})$ and $f_{2}(p_{1},p_{2})$ of total degree $d=3$ , given as

[TABLE]

The equations (S72)–(S73) can be represented under the following decoupled structure:

[TABLE]

revealing the internal univariate polynomials and the linear transformations at the input and output of the structure.

Uniqueness: It is shown that the decoupled models are not (always) unique [168]. Hence decoupling will not lead to ’the’ physical underlying representation, but nevertheless it will still provide an increased intuitive insight in the behavior of the system.

Exact and approximate decomposition

In an exact decoupling of an arbitrary multivariate function, the number of branches $r$ might become very large. A truncated decoupling, with a reduced number of branches can be used to approximate the multivariate function. This problem is studied in [172]. The approximation error is tuned using a weighted least squares criterion. The number of branches $r$ can be used as a handle to balance the complexity of the model against the level of the tolerated structural model errors. This is illustrated on the forced Duffing oscilator in Figure S12, and on the Bouc-Wen model in Figure S28 that are reported in detail in [169].

Relation with neural networks The structure of the decoupled representation in Figure S27 is very similar to that of a neural network as shown in Figure S29. In a neural network, the univariate functions belong all to the same family (for example sigmoids or hinge functions). It is known that neural nets are universal approximators [173, 48, 174] that converges as $O(1/\sqrt{r})$ . In the decoupling method, the univariate functions follow from the decoupling and are tuned to the specific problem. This additional degree of freedom will reduce the number of branches that is needed in the decoupled representation of the multivariate function.

Sidebar

Software Support

The algorithms needed for nonlinear model definition, estimation and analys are typically quite complex, and need sophisticated computer software support. Several program packages for such support is publically available, some for free download and some for license agreements. Here some such packages will be briefly described.

The System Identification Toolbox for use with Matlab

The System Identification Toolbox, [76] is a commercially available program package, distributed by MathWorks Inc. It contains implementations of several of the methods/models discussed in this article

•

idnlarx is a model object that handles the NARX model (22) with several possible nonlinearties.

•

idhw handles the Hammerstein-Wiener block-oriented model in Fig 5,

•

idnlgrey treats nonlinear state-space grey-box models (20) where $\textflorin$ and $h$ have to be programmed by the user either in MATLAB or C-code.

Static nonlinearities (S1) can be chosen for example as

•

pwlinear: (S2)

•

poly1d: (S5)

•

sigmoidnet: (S9)

•

treepartition: (S12)

•

customnet: (S4)

for use in idnlarx and idnlhw. All model validation and evaluation commands like, compare (for cross validation), resid (for (linear) residual analysis), sim, predict, forecast (for simulation and forecasting) are available analogous to linear models, and can be used also to compare linear and nonlinear models in the same command.

Nonparametric Noise Analysis

The freely available frequency-domain identification toolbox FDIDENT can be used to make the nonparametric noise and distortion analysis (“Nonparametric Noise and Distortion Analysis Using Periodic Excitations”) (http://home.mit.bme.hu/~kollar/fdident/). This toolbox also includes the tools to design the random-phase multisines (27) and to perform the nonparametric nonlinear analysis. In [58], all the procedures for the noise and distortion analysis that are presented in this article are discussed in full detail, and the related Matlab software can be freely downloaded from booksupport.wiley.com.

Nonlinear State Space Model

A package to identify a polynomial nonlinear state space model (see “Extensive Case Study: The Forced Duffing Oscilator”) is freely available at http://homepages.vub.ac.be/ jschouk/.

Nonlinear Benchmark Website

The website http://www.nonlinearbenchmark.org/ hosts many experimental data sets that are well documented. For each benchmark, a detailed description of the setup and the experiments is given. In addition, a list of references is provided to publications that process these data. The data for “Extensive Case Study: The Forced Duffing Oscilator” are available on the this website.

Acknowledgement

This work was supported in part by the Fund for Scientific Research (FWO-Vlaanderen), the Vrije Universiteit Brussel (VUB), and by the ERC advanced grant SNLSID, under contract 320378. The authors thank Julien Ertveldt (Vrije Universiteit Brussel) for the results on the wind tunnel experiments, including Figure S22; Dhammika Widanalage and Julian Stouv (Vrije Universiteit Brussels) for the wet-clutch results shown in Figure S19; Gaëtan Kerschen and Jean-Philippe Noël (University of Liège, and Bart Peeters (Siemens Product Life-cycle Management Belgium for the results on the ground vibration test shown in Figure 2; Torbjrön Wigren (University Uppsala), Maarten Schoukens (Vrije Universiteit Brussel), and Keith Worden and his team (University of Sheffield) for the results on the cascaded water tanks, including Figure 10; Oliver Nelles (University of Siegen) and his team for the High Pressure Fuel Supply System example, including Figures 18, 19, and 20; Martijn Vlaar and his team (Technical University Delft), and Georgios Birpoutsoukis (Vrije Universiteit Brussel) for the results on the brain modeling (Figure 26); Rishi Relan (Vrije Universiteit Brussel) for the results on the battery model in Figures 29 and 30; Erliang Zhang (Zhengzhou University) and Maarten Schoukens (Vrije Universiteit Brussel) for the results on process noise detection in Figure S24; Alireza Fakhrizadeh Esfahani (Vrije Universiteit Brussel) and Jean-Philippe Noël (University of Liège) for the results on the Bouc-Wen hysteresis problem (Figure S28).

Author Information

Johan Schoukens received both the Master’s degree in electrical engineering in 1980, and the PhD degree in engineering sciences in 1985 from the Vrije Universiteit Brussel (VUB), Brussels, Belgium. In 1991 he received the degree of Geaggregeerde voor het Hoger Onderwijs from the VUB, and in 2014 the degree of Doctor of Science from The University of Warwick. From 1981 to 2000, he was a researcher of the Belgian National Fund for Scientific Research (FWO-Vlaanderen) at the Electrical Engineering Department of the VUB, From 2000 to 2018 he was a full-time professor in electrical engineering, since 2018 he is emiritus professor at the department INDI of the VUB and at the Department of Electrical Engineering of the TU/e (The Netherlands). From 2009 to 2016, he was visiting professor at the Katholieke Universiteit Leuven (Belgium). His main research interests include system identification, signal processing, and measurement techniques. He has been a Fellow of IEEE since 1997. He was the recipient of the 2002 Andrew R. Chi Best Paper Award of the IEEE Transactions on Instrumentation and Measurement, the 2002 Society Distinguished Service Award from the IEEE Instrumentation and Measurement Society, and the 2007 Belgian Francqui Chair at the Université Libre de Bruxelles (Belgium). Since 2010, he is a member of Royal Flemish Academy of Belgium for Sciences and the Arts. In 2011 he received a Doctor Honoris Causa degree from the Budapest University of Technology and Economics (Hungary). Since 2013, he has been an honorary professor of the University of Warwick.

Lennart Ljung received his Ph.D. in Automatic Control from Lund Institute of Technology in 1974. Since 1976 he is Professor of the chair of Automatic Control In Linköping, Sweden. He has held visiting positions at Stanford and MIT and has written several books on System Identification and Estimation. He is an IEEE Fellow, an IFAC Fellow and an IFAC Advisor. He is a member of the Royal Swedish Academy of Sciences (KVA), a member of the Royal Swedish Academy of Engineering Sciences (IVA), a member of Academia Europaea, an Honorary Member of the Hungarian Academy of Engineering, an Honorary Professor of the Chinese Academy of Mathematics and Systems Science, and a Foreign Member of the US National Academy of Engineering (NAE). He has received honorary doctorates from the Baltic State Technical University in St. Petersburg, from Uppsala University, Sweden, from the Technical University of Troyes, France, from the Catholic University of Leuven, Belgium and from Helsinki University of Technology, Finland. In 2002 he received the Giorgio Quazza Medal from IFAC and in 2017 he received the Nathaniel B Nichols medal also from IFAC. In 2003 he received the Hendrik W. Bode Lecture Prize from the IEEE Control Systems Society, and he was the 2007 recipient of the IEEE Control Systems Award. He was honored with the Great Gold Medal from the Royal Swedish Academy of Engineering Sciences in 2019.

Bibliography174

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L.A. Sadeh. From circuit theory to system theory. Proceedings of the IRE , pages 856–865, 1962.
2[2] K.J. str m and P. Eykhoff. System identification-a survey. Automatica , 7:123–167, 1971.
3[3] G.E.P. Box and G.M. Jenkins. Time Series Analysis, Forecasting and Control . Holden-Day, 1970.
4[4] K.J. Åström. Introduction to Stochastic Control Theory . Academic Press, 1970.
5[5] P. Eykhoff. System Identification and State Estimation . Wiley, 1974.
6[6] L. Ljung. System Identification: Theory for the User . Prentice Hall, Upper Saddle River, New Jersey, 1987 (2nd edition in 1999).
7[7] T. Söderström and P. Stoica. System Identification . Prentice Hall International (UK) Ltd, Hemel Hempstead, Hertfordshire, 1989.
8[8] R. Pintelon and J. Schoukens. System Identification: A Frequency Domain Approach . Wiley-IEEE Press, Hoboken, New Jersey, 2nd edition, 2012.