Modulated Neural ODEs
Ilze Amanda Auzina, \c{C}a\u{g}atay Y{\i}ld{\i}z, Sara Magliacane,, Matthias Bethge, Efstratios Gavves

TL;DR
Modulated Neural ODEs introduce learned, time-invariant modulator variables to enhance the modeling of trajectory variations and improve generalization and forecasting in dynamic systems.
Contribution
The paper proposes a novel framework, MoNODEs, that separates static factors from dynamics in NODEs using learned modulator variables, enhancing existing models.
Findings
MoNODEs improve generalization to new dynamic parameters.
MoNODEs enhance far-horizon forecasting accuracy.
Modulator variables are informative of true factors of variation.
Abstract
Neural ordinary differential equations (NODEs) have been proven useful for learning non-linear dynamics of arbitrary trajectories. However, current NODE methods capture variations across trajectories only via the initial state value or by auto-regressive encoder updates. In this work, we introduce Modulated Neural ODEs (MoNODEs), a novel framework that sets apart dynamics states from underlying static factors of variation and improves the existing NODE methods. In particular, we introduce that are learned from the data. We incorporate our proposed framework into four existing NODE variants. We test MoNODE on oscillating systems, videos and human walking trajectories, where each trajectory has trajectory-specific modulation. Our framework consistently improves the existing model ability to generalize to new dynamic parameterizations and to…
| Model | Reference | Time | ||
| Latent | invariant | Temporal | ||
| dynamics | modulator | data | ||
| Neural ODE (NODE) | Chen et al. [2018] | ✓ | ✗ | ✓ |
| Augmented NODE (ANODE) | Dupont et al. [2019] | ✗ | ✗ | ✗ |
| Neural Controlled DE (NCDE) | Kidger et al. [2020] | ✓ | ✗ | ✗ |
| Second Order NODE (SONODE) | Norcliffe et al. [2020] | ✗ | ✗ | ✓ |
| Latent SONODE (LSONODE) | Yildiz et al. [2019] | ✓ | ✗ | ✓ |
| NODE Processes (NODEP) | Norcliffe et al. [2021] | ✓ | ✗ | ✓ |
| Heavy Ball NODE (HBNODE) | Xia et al. [2021] | ✗ | ✗ | ✓ |
| Modulated *NODE | this work | ✓ | ✓ | ✓ |
| Sinusoidal data | Predator-prey data | |||
|---|---|---|---|---|
| Model | ||||
| NODE | 0.13 (0.03) | 1.84 (0.70) | 0.85 (0.08) | 23.81 (2.29) |
| MoNODE (ours) | 0.04 (0.01) | 0.29 (0.11) | 0.74 (0.05) | 4.33 (0.19) |
| SONODE | 2.19 ( 0.15) | 3.05 ( 0.07) | 15.80 (0.75) | 40.92 (0.82) |
| MoSONODE (ours) | 0.05 ( 0.01) | 0.35 ( 0.10) | 1.46 (0.28) | 6.70 (0.83) |
| HBNODE | 0.16 ( 0.02) | 3.36 ( 0.33) | 0.88 ( 0.10) | 3346.62 (2119.24) |
| MoHBNODE (ours) | 0.05 ( 0.01) | 0.65 ( 0.30) | 0.94 ( 0.09) | 10.21 ( 1.43) |
| NODE | MoNODE | |
|---|---|---|
| Sine | 0.90 | 0.99 |
| PP | -1.35 | 0.39 |
| BB | -0.29 | 0.58 |
| Bouncing Ball | Rot.MNIST | Mocap | Mocap-shift | |
|---|---|---|---|---|
| NODE | 0.039 (0.003) | |||
| MoNODE | 0.030 (0.001) |
| Validation MSE | ||||
| 1 | 3 | 5 | 10 | |
| HBNODE | 0.164 | 0.135 | 0.144 | 0.066 |
| Dataset | ||||||||
| Sinusoidal Data | 300 | 50 | 50 | 50 | 50 | 150 | 0.1 | 0.2 |
| Lotka-Volterra | 600 | 100 | 100 | 100 | 100 | 300 | 0.1 | 0.3 |
| Rotating MNINST | 2000 | 100 | 100 | 16 | 45 | 45 | 0.1 | 0.0 |
| Bouncing ball | 1000 | 100 | 100 | 20 | 20 | 40 | 0.1 | 0 |
| dataset | model | lr | N# parameters | |||||
|---|---|---|---|---|---|---|---|---|
| sin | NODE | 10 | - | 8 | - | - | 0.002 | 24666 |
| MoNODE | 3 | 10 | 4 | 4 | - | 0.002 | 24598 | |
| SONODE | 10 | - | 2 | - | - | 0.002 | 21802 | |
| MoSONODE | 10 | 10 | 2 | 4 | - | 0.002 | 23346 | |
| HBNODE | 10 | - | 8 | - | - | 0.002 | 24404 | |
| MoHBNODE | 10 | 10 | 4 | 4 | - | 0.002 | 24938 | |
| Prey-predator | NODE | 40 | - | 16 | - | - | 0.002 | 28022 |
| MoNODE | 8 | 40 | 8 | 8 | - | 0.002 | 26976 | |
| SONODE | 40 | - | 4 | - | - | 0.002 | 29204 | |
| MoSONODE | 40 | 40 | 4 | 8 | - | 0.002 | 31382 | |
| HBNODE | 40 | - | 16 | - | - | 0.002 | 26586 | |
| MoHBNODE | 10 | 40 | 8 | 8 | - | 0.002 | 26744 | |
| Rotating MNIST | NODE | 5 | - | 32 | - | - | 0.001 | 513561 |
| MoNODE | 5 | 15 | 16 | 16 | - | 0.001 | 558681 | |
| mocap | NODE | 75 | - | 24 | - | - | 0.002 | 45112 |
| MoNODE | 75 | 75 | 8 | 8 | 8 | 0.002 | 51360 | |
| mocap-shift | NODE | 75 | - | 24 | - | - | 0.002 | 45112 |
| MoNODE | 75 | 75 | 8 | 8 | 8 | 0.002 | 51360 | |
| Bouncing ball | NODE | 5 | - | 212 | - | - | 0.001 | 162343 |
| MoNODE | 5 | 5 | 8 | 4 | - | 0.001 | 150479 |
| Dataset | Model | Position | Velocity | Modulator P. | Differential | ODE | Decoder |
|---|---|---|---|---|---|---|---|
| Encoder | Encoder | Network | Function | Solver | |||
| sin | NODE | RNN | - | - | MLP | rk4 | MLP |
| prey-predator | MoNODE | RNN | - | RNN | MLP | rk4 | MLP |
| mocap | SONODE | - | MLP | - | MLP | rk4 | - |
| mocap-shift | MoSONODE | - | MLP | RNN | MLP | rk4 | - |
| rotating mnist | NODE | CNN | - | - | MLP | rk4 | CNN |
| MoNODE | CNN | - | RNN | MLP | rk4 | CNN | |
| Bouncing ball | NODE | CNN | CNN | - | MLP | rk4 | CNN |
| MoNODE | CNN | CNN | CNN | MLP | rk4 | CNN |
| model | seq. length | euler | rk4 | dopri5 |
|---|---|---|---|---|
| NODE | 0.13( 0.03) | 0.13 (0.03) | 0.19( 0.09) | |
| MoNODE | 0.07( 0.02) | 0.04 (0.01) | 0.05( 0.01) | |
| NODE | 2.48( 1.26) | 1.84 (0.70) | 2.11( 0.56) | |
| MoNODE | 0.42( 0.21) | 0.29 (0.11) | 0.31( 0.04) |
| Velocity | Test MSE (std) | ||||||
| Data | Model | Encoder | |||||
| Sinusoidal Data | MoNODE | - | 3 | 10 | 10 | 0.04 ( 0.01) | 0.29 ( 0.11) |
| - | 10 | 10 | 10 | 0.05 ( 0.01) | 0.37 ( 0.12) | ||
| SONODE | MLP | 5 | - | 10 | 2.18 ( 0.06) | 3.05 ( 0.02) | |
| MLP | 10 | - | 4 | 1.81 ( 0.01) | 3.05 ( 0.07) | ||
| RNN | 10 | - | 10 | 2.19 ( 0.15) | 3.05 ( 0.07) | ||
| MoSONODE | MLP | 5 | 10 | 10 | 0.04 ( 0.00) | 0.29 ( 0.04) | |
| MLP | 10 | 10 | 4 | 0.04 ( 0.00) | 0.32 ( 0.06) | ||
| RNN | 10 | 10 | 10 | 0.05 ( 0.01) | 0.35 ( 0.10) | ||
| Predator-Prey | MoNODE | - | 8 | 40 | 10 | 0.74 ( 0.05) | 4.33 ( 0.19) |
| - | 40 | 40 | 10 | 0.57 ( 0.04) | 23.05 ( 1.49) | ||
| SONODE | MLP | 10 | - | 10 | 15.801 (0.748) | 40.921 (0.816) | |
| MLP | 40 | - | 2 | 5.130 (0.221) | 39.260 (3.367) | ||
| RNN | 40 | - | 10 | 5.101(0.339) | 41.890 (8.444) | ||
| MoSONODE | MLP | 10 | 40 | 10 | 1.459 (0.284) | 6.695 (0.828) | |
| MLP | 40 | 40 | 2 | 1.093 (0.059) | 6.342 (0.982) | ||
| RNN | 40 | 40 | 10 | 1.294 (0.322) | 8.639 (1.468) | ||
| HBNODE | RNN | 40 | - | 10 | 0.879 (0.096) | 3346.625 (2119.239) | |
| RNN | 10 | - | 10 | 13.803 (0.138) | 2833.051 (3254.703) | ||
| MoHBNODE | RNN | 40 | 40 | 10 | 0.870 (0.088) | 225.012 (274.485) | |
| RNN | 10 | 40 | 10 | 0.943 (0.092) | 10.205 (1.429) | ||
| #N | Test MSE (std) | ||||
| model | parameters | ||||
| NODE | 16 | - | 461161 | 0.020 | 0.098 |
| MoNODE | 6 | 10 | 513637 | 0.031 | 0.034 |
| 8 | 8 | 516089 | 0.031 | 0.032 | |
| NODE | 24 | - | 487361 | 0.014 | 0.062 |
| MoNODE | 8 | 16 | 532481 | 0.030 | 0.031 |
| 12 | 12 | 537385 | 0.035 | 0.038 | |
| NODE | 32 | - | 513561 | 0.013 | 0.042 |
| MoNODE | 8 | 24 | 548873 | 0.035 | 0.037 |
| 12 | 20 | 553777 | 0.033 | 0.033 | |
| 16 | 16 | 558681 | 0.031 | 0.032 | |
| mocap | mocap-shift | ||||
| NODE | 6 | - | - | ||
| 12 | - | - | |||
| 24 | - | - | |||
| MoNODE | 3 | 3 | - | ||
| 6 | 6 | - | |||
| 12 | 12 | - | |||
| 3 | - | 3 | |||
| 6 | - | 6 | |||
| 12 | - | 12 | |||
| 2 | 2 | 2 | |||
| 4 | 4 | 4 | |||
| 8 | 8 | 8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Gaussian Processes and Bayesian Inference
MethodsNeural Oblivious Decision Ensembles
Modulated Neural ODEs
Ilze Amanda Auzina
University of Amsterdam
&Çağatay Yıldız
University of Tübingen
Tübingen AI Center
&Sara Magliacane
University of Amsterdam
MIT-IBM Watson AI Lab
&Matthias Bethge
University of Tübingen
Tübingen AI Center
&Efstratios Gavves
University of Amsterdam
Abstract
Neural ordinary differential equations (NODEs) have been proven useful for learning non-linear dynamics of arbitrary trajectories. However, current NODE methods capture variations across trajectories only via the initial state value or by auto-regressive encoder updates. In this work, we introduce Modulated Neural ODEs (MoNODEs), a novel framework that sets apart dynamics states from underlying static factors of variation and improves the existing NODE methods. In particular, we introduce time-invariant modulator variables that are learned from the data. We incorporate our proposed framework into four existing NODE variants. We test MoNODE on oscillating systems, videos and human walking trajectories, where each trajectory has trajectory-specific modulation. Our framework consistently improves the existing model ability to generalize to new dynamic parameterizations and to perform far-horizon forecasting. In addition, we verify that the proposed modulator variables are informative of the true unknown factors of variation as measured by scores.
1 Introduction
Differential equations are the de facto standard for learning dynamics of biological [Hirsch et al., 2012] and physical [Tenenbaum and Pollard, 1985] systems. When the observed phenomenon is deterministic, the dynamics are typically expressed in terms of ordinary differential equations (ODEs). Traditionally, ODEs have been built from a mechanistic perspective, in which states of the observed system and the governing differential equation with its parameters are specified by domain experts. However, when the parametric form is unknown and only observations are available, neural network surrogates can be used to model the unknown differential function, called neural ODEs (NODEs) Chen et al. [2018]. Since its introduction by Chen et al. [2018], there has been an abundance of research that uses NODE type models for learning differential equation systems and extend it by introducing a recurrent neural networks (RNNs) encoder [Rubanova et al., 2019, Kanaa et al., 2021], a gated recurrent unit [De Brouwer et al., 2019, Park et al., 2021], or a second order system [Yildiz et al., 2019, Norcliffe et al., 2020, Xia et al., 2021].
Despite these recent developments, all of the above methods have a common limitation: any static differences in the observations can only be captured in a time-evolving ODE state. This modelling approach is not suitable when the observations contain underlying factors of variation that are fixed in time, yet can (i) affect the dynamics or (ii) affect the appearance of the observations. As a concrete example of this, consider human walking trajectories. The overall motion is shared across all subjects, e.g. walking. However, every subject might exhibit person-specific factors of variation, which in turn could either affect the motion exhibited, e.g. length of legs, or could be a characteristic of a given subject, e.g. color of a shirt. A modelling approach that is able to (i) distinguish the dynamic (e.g. position, velocity) from the static variables (e.g. height, clothing color), and (ii) use the static variables to modulate the motion or appearance, is advantageous, because it leads to an improved generalization across people. As an empirical confirmation, we show experimentally that the existing NODE models fail to separate the dynamic factors from the static factors. This inevitably leads to overfitting, thus, negatively effecting the model generalization to new dynamics, as well as far-horizon forecasting.
As a remedy, this work introduces a modulated neural ODE (MoNODE) framework, which separates the dynamic variables from the time-invariant variables. We call these time-invariant variables modulator variables and we distinguish between two types: (i) static modulators that modulate the appearance; and (ii) dynamics modulators that modulate the time-evolution of the latent dynamical system (for a schematic overview see Fig. 1). In particular, MoNODE adds a modulator prediction network on top of a NODE, which allows to compute the modulator variables from data. We empirically confirm that our modular framework boosts existing NODE models by achieving improved future predictions and improved generalization to new dynamics. In addition, we verify that our modulator variables are more informative of the true unknown factors of variation by obtaining higher scores than NODE. As a result, the latent ODE state of MoNODE has an equivalent sequence-to-sequence correspondence as the true observations. Our contributions are as follows:
- •
We extend neural ODE models by introducing modulator variables that allow the model to preserve core dynamics while being adaptive to modulating factors.
- •
Our modulator framework can easily be integrated into existing NODE models [Chen et al., 2018, Yildiz et al., 2019, Norcliffe et al., 2020, Xia et al., 2021].
- •
As we show in our experiments on sinusoidal waves, predator-prey dynamics, bouncing ball videos, rotating images, and motion capture data, our modulator framework consistently leads to improved long-term predictions as measured by lower test mean squared error (an average of 55.25% improvement across all experiments).
- •
Lastly, we verify that our modulator variables are more informative of the true unknown factors of variation than NODE, as we show in terms of scores.
While introduced in the context of neural ODEs, we believe that the presented framework can benefit also stochastic and/or discrete dynamical systems, as the concept of dynamic or mapping modulation is innate for most dynamical systems. We conclude this work by discussing the implications of our method on learning object-centric representations and its potential Bayesian extensions. Our official implementation can be found at https://github.com/IlzeAmandaA/MoNODE.
2 Background
We first introduce the basic concepts for Neural ODEs, following the notation by Chen et al. [2018].
Ordinary differential equations
Multivariate ordinary differential equations are defined as
[TABLE]
where denotes time, the vector captures the state of the system at time , and is the time derivative of the state . In this work, we focus on autonomous ODE systems, implying a vector-valued (time) differential function that does not explicitly depend on time. The ODE state solution is computed by integrating the differential function starting from an initial value :
[TABLE]
For non-linear differential functions , the integral does not have a closed form solution and hence is approximated by numerical solvers [Tenenbaum and Pollard, 1985]. Due to the deterministic nature of the differential function, the ODE state solution is completely determined by the corresponding initial value if the function is known.
2.1 Latent neural ODEs
Chen et al. [2018] proposed neural ODEs for modeling sequential data , where is the -dimensional observation at time , and is the sequence length. We assume known observation times . Being a latent variable model, NODE infers a latent trajectory for an input trajectory . The generative model relies on random initial values, their continuous-time transformations, and finally an observation mapping from latent to data space:
[TABLE]
Here, the time differential is a neural network with parameters (hence the name “neural ODEs”). Similar to variational auto-encoders [Kingma and Welling, 2013, Rezende et al., 2014], the “decoding” of the observations is performed by another non-linear neural network with a suitable architecture and parameters .
3 MoNODE: Modulated Neural ODEs
We begin with a brief description of the dynamical systems of our interest. Without loss of generality, we consider a dataset of trajectories with a fixed trajectory length , where the ’th trajectory is denoted by . First, we make the common assumption that the data trajectories are generated by a single dynamical system (e.g., a swinging pendulum) while the parameters that modulate the dynamics (e.g., pendulum length) vary across the trajectories. Second, we focus on the more general setup in which the observations and the dynamics might lie in different spaces, for example, video recordings of a pendulum. A future video prediction would require a mapping from the dynamics space to the observation space. As the recordings might exhibit different lighting conditions or backgrounds, the mapping typically involves static features that modulate the mappings, e.g., a parameter specifying the background.
3.1 Our generative model
The generative model of NODE [Chen et al., 2018] involves a latent initial value for each observed trajectory as well as a differential function and a decoder with global parameters and . Hence, by construction, NODE can attribute discrepancies across trajectories only to the initial value as the remaining functions are modeled globally. Subsequent works [Rubanova et al., 2019, Dupont et al., 2019, De Brouwer et al., 2019, Xia et al., 2021, Iakovlev et al., 2023] combine the latent space of NODE with additional variables, however, likewise, they do not account for static, trajectory-specific factors that modulate either the dynamics or the observation mapping. The main claim of this work is that explicitly modeling the above-mentioned modulating variables results in better extrapolation and generalization abilities. To show this, we introduce the following generative model:
[TABLE]
We broadly refer to and as dynamics and static modulators, and thus name our framework Modulated Neural ODE (MoNODE). We assume that each observed sequence has its own modulators and . As opposed to the ODE state , the modulators are time-invariant.
We note that for simplicity, we describe our framework in the context of the initial neural ODE model [Chen et al., 2018]. However, our framework can be readily adapted to other neural ODE models such as second-order and heavy ball, NODEs, as we demonstrate experimentally. For a schematic overview, please see Fig. 1. Next, we discuss how to learn the modulator variables along with the global dynamics, encoder, and decoder.
3.2 Learning latent modulator variables
A straightforward approach to obtain time-invariant modulator variables is to define them globally and independently of each other and input sequences. While optimizing for such global variables works well in practice [Blei et al., 2017], it does not specify how to compute variables for an unobserved trajectory. As the focus of the present work is on improved generalization to unseen dynamic parametrizations, we estimate the modulator variables via amortized inference based on encoder networks and , which we detail in the following.
(i) Static modulator
To learn the unknown static modulator that captures the time-invariant characteristics of the individual observations , we compute the average over the observation embeddings provided by a modulator prediction network (*e.g., *a convolutional neural network) with parameters :
[TABLE]
By construction, is time-invariant (or more rigorously, invariant to time-dependent effects) as we average over time. In turn, the decoder takes as input the concatenation of the latent ODE state and the static modulator , and maps the joint latent representation to the observation space (similarly to Franceschi et al. [2020]):
[TABLE]
Note that the estimated static modulator is fed as input for all time points within a trajectory.
(ii) Dynamics modulator
Unlike the static modulator, the dynamics modulator can only be deduced from multiple time points . For example, the dynamics of a pendulum depend on its length. To compute the length of the pendulum one must compute the acceleration for which multiple position and velocity measurements are needed. Thereby, the dynamics modulators, , are computed from subsequences of length from a given trajectory. To achieve time-invariance we likewise average over time:
[TABLE]
where is a modulator prediction network (e.g., a recurrent neural network) with parameters . The differential function takes as input the concatenation of the latent ODE state and the estimated dynamics modulator . Consequently, we redefine the input space of the differential function , implying the following time differential:
[TABLE]
The resulting ODE system resembles in a way the augmented neural ODE (ANODE) [Dupont et al., 2019]. However, their appended variable dimensions are constant and serve a practical purpose of breaking down the diffeomorphism constraints of NODE, while ours models time-invariant variables. We treat as a hyperparameter and choose it by cross-validation.
Optimization objective
The maximization objective of MoNODE is analogous to the evidence-lower bound () as in [Chen et al., 2018] for NODE, where we place a prior distribution on the unknown latent initial value and approximate it by amortized inference. Similar to previous works [Chen et al., 2018, Yildiz et al., 2019], MoNODE encoder for takes a sequence of length as input, where is a hyper-parameter. We empirically observe that our framework is not sensitive to . The optimization of the modulator prediction networks and is implicit in that they are trained jointly with other modules while maximizing the objective.
4 Related work
Neural ODEs
Since the neural ODE breakthrough [Chen et al., 2018], there has been a growing interest in continuous-time dynamic modeling. Such attempts include combining recurrent neural nets with neural ODE dynamics [Rubanova et al., 2019, De Brouwer et al., 2019], where latent trajectories are updated upon observations, as well as upon Hamiltonian [Zhong et al., 2019], Lagrangian [Lutter et al., 2019], second-order [Yildiz et al., 2019], or graph neural network based dynamics [Poli et al., 2019]. While our method MoNODE has been introduced in the context of latent neural ODEs, it can be directly utilized within these frameworks as well.
Augmented dynamics
Dupont et al. [2019] augment data-space neural ODEs with additional latent variables and test their method on classification problems. Norcliffe et al. [2021] extend neural ODEs to stochastic processes by means of stochastic latent variables, leading to NODE Processes (NODEP). By construction, NODEP embeddings are invariant to the shuffling of the observations. To the best of our knowledge, we are the first to explicitly enforce time-invariant modulator variables.
Learning time-invariant variables
The idea of averaging for invariant function estimation was used in [Kondor, 2008, van der Wilk et al., 2018, Franceschi et al., 2020]. Only the latter proposes using such variables in the context of discrete-time stochastic video prediction. Although relevant, their model involves two sets of dynamic latent variables, coupled with an LSTM and is limited to mapping modulation.
5 Experiments
To investigate the effect of our proposed dynamics and static modulators, we structure the experiments as follows: First, we investigate the effect of the dynamics modulator on classical dynamical systems, namely, sinusoidal wave, predator-prey trajectories and bouncing ball (section 5.1), where the parameterisation of the dynamics differs across each trajectory. Second, to confirm the utility of the static modulator we implement an experiment of rotating MNIST digits (section 5.2), where the static content is the digit itself. Lastly, we experiment on real data with having both modulator variables present for predicting human walking trajectories (section 5.4). In all experiments, we test whether our framework improves the performance of the base model on generalization to new trajectories and long-horizon forecasting abilities.
Implementation details
We implement all models in PyTorch [Paszke et al., 2017]. The encoder, decoder, differential function, and modulator prediction networks are all jointly optimized with the Adam optimizer [Kingma and Ba, 2014]. For solving the ODE system we use torchdiffeq [Chen, 2018] package. We use the th-order Runge-Kutta numerical solver to compute ODE state solutions (see App. D for ablation results for different solvers). For the complete details on data generation and training setup, we refer to App.B. Further, we report the architectures, number of parameters, and details on hyperparameter for each method in App. C Table 7.
Compared methods
We test our framework on the following models: (i) Latent neural ODE model [Chen et al., 2018] (NODE), (ii) Second-order NODE model [Norcliffe et al., 2020] (SONODE), (iii) Latent second-order NODE model [Yildiz et al., 2019] (LSONODE), (iv) current state-of-the-art, second-order heavy ball NODE [Xia et al., 2021] (HBNODE).
In order for SONODE and HBNODE to have a comparable performance with NODE we adjust the original implementation by the authors by changing the encoder architecture, while keeping the core of the models, the differential function, unchanged. For further details and discussion, see App. A. We do not compare against ANODE [Dupont et al., 2019] as the methodology is presented in the context of density estimation and classification. Furthermore, we performed preliminary tests with NODEP [Norcliffe et al., 2021]; however, the model predictions fall back to the prior in datasets with dynamics modulators. Hence, we did not include any results with this model in the paper as the base model did not have sufficiently good performance. Finally, we chose not to compare against Kidger et al. [2020] as their Riemann–Stieltjes integral relies on smooth interpolations between data points while we particularly focus on the extrapolation performance for which the ground truth data is not available. For an overview of the methods discussed see Table 1.
5.1 Dynamics modulator variable
To investigate the benefit of the dynamics modulator variable, we test our framework on three dynamical systems: sine wave, prey-predator (PP) trajectories, and bouncing ball (BB) videos. In contrast to earlier works [Norcliffe et al., 2020, Rubanova et al., 2019], every trajectory has a different parameterization of the differential equation. Intuitively, the goal of the modulator prediction network is to learn this parameterisation and pass it as an input to the dynamics function, which is modelled by a neural network.
5.1.1 Sinusoidal data
The training data consists of oscillating trajectories with length . The amplitude of each trajectory is sampled as and the frequency is sampled as , where denotes a uniform distribution. Validation and test data consist of trajectories with sequence length and , respectively. We add noise to the data following the implementation of [Rubanova et al., 2019].
The obtained test MSE demonstrates that our framework improves all aforementioned methods, namely, NODE, SONODE, and the state-of-the-art HBNODE (see Fig. 2, Fig. 11, and Fig. 12). In particular, our framework improves generalization to new dynamics and far-horizon forecasting as reported in Table 2 columns 2 and 3. In addition, modulating the motion via the dynamics modulator variable leads to interpretable latent ODE state trajectories , see Fig. 2. More specifically, the obtained latent trajectories have qualitatively a comparable radius topology to the observed amplitudes in the data space. By contrast, the latent space of NODE does not have such structure. Similar results are obtained for MoHBNODE (App. D, Fig. 12).
Training and inference times
To showcase that our proposed framework is easy to train we have plotted the validation MSE versus wall clock time during training for sinusoidal data, please see App. Fig. 10. As it is apparent from the figure, our framework is easier to train than all baseline methods. We further compute the inference time cost for the sin data experiment, where the test data consists of 50 trajectories of length 150. We record the time it takes NODE and MoNODE to predict future states while conditioned on the initial 10 time points. Repeating the experiment ten times, the inference time cost for NODE is while for MoNODE is .
5.1.2 Predator-prey (PP) data
Next, we test our framework on the predator-prey benchmark [Rubanova et al., 2019, Norcliffe et al., 2021], governed by a pair of first-order nonlinear differential equations (also known as Lotka-Volterra system, Eq. 28). The training data consists of trajectories of length . For every trajectory the four parameters of the differential equation are sampled, therefore each trajectory specifies a different interaction between the two populations. Validation and test data consist of trajectories with sequence length and . Similarly to sinusoidal data, we add noise to the data following [Rubanova et al., 2019].
The results show that our modulator framework improves the test accuracy of the existing methods for both, generalization to new dynamics as well as for forecasting, see Table 2. Moreover, examining the latent ODE state embeddings reveals that our framework results in more interpretable latent space embeddings also for PP data, see Fig. 3 for HBNODE and App. D Fig. 13 for NODE. In particular, the latent space of MoHBNODE and MoNODE captured the same amplitude relationship across trajectories as in observation space. For visualisation of the SONODE, MoSONODE, NODE, and MoNODE trajectories see App. D, Fig. 14, Fig. 13.
5.1.3 Bouncing ball (BB) with friction
To investigate the performance of our framework on video sequences we test it on a bouncing ball dataset, a benchmark often used in temporal generative modeling [Sutskever et al., 2008, Gan et al., 2015, Yildiz et al., 2019]. For data generation, we modify the original implementation of Sutskever et al. [2008] by adding friction to every data trajectory, where friction is sampled from . The friction slows down the ball by a constant factor and is to be inferred by the dynamics modulator. We use training sequences of length , and validation and test trajectories with length and . Our framework improves predictive capability for video sequences as shown in Table 4. As visible in Fig .15, the standard NODE model fails to predict the position of the object at further time points, while MoNODE corrects this error. For the second-order variants, LSONODE and MoLSONODE, we again observe our framework improving MSE from to .
5.1.4 Informativeness metric
Next, we quantify how much information latent representations carry about the unknown factors of variation (FoVs) [Eastwood and Williams, 2018], which are the parameters of the sinusoidal, PP, and BB dynamics. As described in [Schott et al., 2021], we compute scores by regressing from latent variables to FoVs. The regression inputs for MoNODE are the dynamics modulators , while for NODE the latent trajectories . Note that corresponds to perfect regression and indicates random guessing. Our framework obtains better scores on all benchmarks (Table 3), implying better generalization capability of our framework. Therefore, as stated in [Eastwood and Williams, 2018], our MoNODE is better capable of disentangling underlying factors of variations compared to NODE.
5.2 Static modulator variable
To investigate the benefit of the static modulator variable, we test our framework on Rotating MNIST dataset, where the dynamics are fixed and shared across all trajectories, however, the content varies. The goal of the modulator prediction network is to learn the static features of each trajectory and pass it as an input to the decoder network. The data is generated following the implementation by [Casale et al., 2018], where the total number of rotation angles is . We include all ten digits and the initial rotation angle is sampled from all possible angles .
The training data consists of trajectories with length , which corresponds to one cycle of rotation. Validation and test data consist of trajectories with sequence length . At test time, the model receives the first time frames as input and predicts the full horizon (). We repeat each experiment 3 times and report the mean and standard deviation of the MSEs computed on the forecasting horizon (from to ). For further training details see App.B, while for a complete overview of hyperparameter see App. C Table 7.
The obtained test MSE confirms that the static modulator variable improves forecasting quality (), see Table 4 and App. D fig. 16 for qualitative comparison. In addition, the latent ODE states of MoNODE form a circular rotation pattern resembling the observed dynamics in data space while for NODE no correspondence is observed, see App. D Fig. 16. Moreover, as shown in Fig. 4, the latent space of MoNODE captured the relative distances between the initial rotation angles, while NODE did not. The TSNE embeddings of the static modulator indicate a clustering per digit shape, see App. D Fig. 17. For additional results and discussion, see App. D.
5.3 Are modulators interchangeable?
To confirm the role of each modulator variable we have performed two additional ablations with the MoNODE framework on: (a) sinusoidal data with static modulator instead of dynamics, and (b) rotating MNIST with dynamics modulator instead of static. We report the test MSE across three different initialization runs with standard deviation. For sinusoidal data with MoNODE + static modulator, the test MSE performance drops to from (MoNODE + dynamic modulator). For rotating MNIST, MoNODE + dynamics modulator performance drops to from (MoNODE + static modulator). In addition, we examined the latent embeddings of the dynamics modulator for rotating MNIST. Where previously for the content modulator we observed clusters corresponding to a digit’s class, for dynamics modulator such a topology in the latent space is not present (see App. D Fig. 17). Taken together with Table 3, the results confirm that the modulator variables are correlated with the true underlying factors of variation and play their corresponding roles.
5.4 Real world application: modulator variables
Next, we evaluate our framework on a subset of CMU Mocap dataset, which consists of 56 walking sequences from 6 different subjects. We pre-process the data as described in [Wang et al., 2007], resulting in -dimensional data sequences. We consider two data splits in which (i) the training and test subjects are the same, and (ii) one subject is reserved for testing. We refer to the datasets as Mocap and Mocap-shift (see App. B for details). In test time, the model receives the first observations as input and predicts the full horizon ( time points). We repeat each experiment five times and report the mean and standard deviation of the MSEs computed on the full horizon, As shown in Table 4 and Fig. 5, our framework improves upon NODE. See Table 12 for ablations with different latent dimensionalities and with only one modulator variable (static or dynamics) present.
6 Discussion
The present work introduces a novel modulator framework for NODE models that allows to separate time-evolving ODE states from modulator variables. In particular, we introduce two types of modulating variables: (i) dynamics modulator that can modulate the dynamics function and (ii) static modulator that can modulate the observation mapping function. Our empirical results confirm that our framework improves generalization to new dynamics and far-horizon forecasting. Moreover, our modulator variables better capture the true unknown factors of variation as measured by score, and, as a result, the latent ODE states have an equivalent correspondence to the true observations.
Limitations and future work
The dynamical systems explored in the current work are limited to deterministic periodic systems that have different underlying factors of variation. The presented work introduces a framework that builds upon a base NODE model, hence, the performance of Mo*NODE is largely affected by the base model’s performance. The current formulation cannot account for epistemic uncertainty and does not generalize to out-of-distribution modulators, because we maintain point estimates for the modulators. A straightforward extension would be to apply our framework to Gaussian process-based ODEs [Hegde et al., 2022] stochastic dynamical systems via an auxiliary variable that models the noise, similarly to Franceschi et al. [2020]. Likewise, rather than maintaining point estimates, the time-invariant modulator parameters could be inferred via marginal likelihood as in van der Wilk et al. [2018], Schwöbel et al. [2022], leading to a theoretically grounded Bayesian framework. Lastly, the separation of the dynamic factors and modulating factors could be explicitly enforced via an additional self-supervised contrasting loss term [Grill et al., 2020] or by more recent advances in representation learning [Bengio et al., 2013]. Lastly, the concept of dynamics modulator could also be extended to object-centric dynamical modeling [Kabra et al., 2021], which would allow accounting for per-object specific dynamics modulations while using a single dynamic model.
Acknowledgements
The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. Çağatay Yıldız funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC-Number 2064/1 – Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 950086).
Appendix A Model Encoder Adjustments
SONODE
In the original implementation by [Norcliffe et al., 2020] the initial velocity is computed by a 3-layer MLP with elu activation functions and only the initial time point, , is passed as an input. For the initial position the first observation time point is passed, e.g. . The resulting model fails to fit on sinusoidal data, see fig. 6 for performance on validation trajectories.
As it can be seen in figure 6 the model only fits the first data point correctly, therefore, we extend the number of input time points passed to the model to to compute the initial velocity. This results in notable performance improvements, see Fig. 7.
Consequently, we increase the number of input frames used for SONODE, as well as investigate replacing the MLP architecture with a RNN. We report the resulting test MSEs in App. D Table 10. For the parameters used in the experiments please see Table 7.
HBNODE
In the original implementation by [Xia et al., 2021] the encoder is a 3-layer MLP with tanh activation functions that autoregressively takes every ground truth data point as input, e.g. for and predicts the latent representation which is also fed as input to the proceeding time points . The differential function is modeled by linear layer projection. As the present work is focused on model forecasting capabilities we adjust the original HBNODE to be compatible with problem set-ups where the ground truth data is not available. Meaning that once the model has reached the time point where there is no more ground truth data available (), the model only uses it’s own latent state predictions to compute the next latent state . As the authors of HBNODE claim that the model is designed to better fit long-term dependencies we initially reduce the number of subsequent increments, from 10 to 3 (see App. B for training details). In Fig. 8 we show the qualitative performance of HBNODE on validation trajectories. Even though the model fits perfectly the ground truth data within , it fails to extrapolate to future time frames. We investigate the cause of this issue by, first, changing the differential function from a linear projection to a 3-layer MLP with tanh activations, and, second, by replacing the original autoregressive encoder with a RNN encoder that learns the initial position .
Increasing the modelling capacity of the differential function did not resolve the observed issue with forecasting. This implies that the observed over-fitting of the model might be caused due to its autoregressive encoder that takes every ground truth time point as input. By replacing the autoregressive encoder with an RNN encoder we obtain significant forecasting improvements, see Fig. 9. Therefore, in all our reported experiments we replace the autoregressive encoder with an RNN and performed additional tests on the number of data increments as well as the number of input frames to compute the initial state . We find in our experimentation that brings the best model performance (see Table 5) and subsequently test different frames, 10 and 40, respectively. We report the obtained test MSE for each set-up across 3 model runs with different seeds in App. D Table 10 and for the final model parameter settings used in the experiments please see App. C.
Appendix B Experiment details
All models for sine, PP and rotating MNIST have been trained on a single GPU (GTX 1080 Ti) with 10 CPUs and 30G memory.
Sine data
In the same vein as Rubanova et al. [2019], Norcliffe et al. [2021], we first define a set of time points . Then each sequence is generated as follows:
[TABLE]
where denotes the uniform distribution and is set to . We use to generate training trajectories of length , validation trajectories of length , and test trajectories of length . Test trajectories are longer to test model forecasting accuracy.
During training we set batch size to 10. We incrementally increase the sequence length of training trajectories starting from until . We perform this in 10 increments for NODE, MoNODE, HBNODE and MoHBNODE, and in 4 increments for SONODE and MoSONODE. All models are trained for 600 epochs. The model with lowest validation MSE is used for evaluation on test trajectories.
Predator-Prey data
To generate Predator-Prey trajectories we define the following generative process:
[TABLE]
where denotes the uniform distribution and is set to . We use to generate training trajectories of length , validation trajectories of length , and test trajectories of length . Test trajectories are longer to test model forecasting accuracy.
During training we set the batch size to 20. We incrementally increase the sequence length of training trajectories starting from until . We perform this in 10 increments for NODE, MoNODE, HBNODE and MoHBNODE, and in 2 increments for SONODE and MoSONODE. All models are trained for 1500 epochs. The model with lowest validation MSE is used for evaluation on test trajectories.
Rotating MNIST
For the data generation we base our code upon the image rotation implementation provided by [Solin et al., 2021]. We set the total number of rotation angles to and sample the initial rotation angle from all the possible angles . We generate training trajectories with length and validation and test trajectories with length .
During training we set the batch size to 25 and learning rate to 0.002. We train all models for 400 epochs and for evaluation use the trained model with the lowest validation mse.
Bouncing ball with friction
For data generation, we use the script provided by Sutskever et al. [2008]. As a tiny update, each observed sequence has a friction constant drawn from a uniform distribution . We set the initial velocities to be unit vectors with random directions.
Motion capture
The walking sequences we consider are 07_01.amc, 08_03.amc, 35_02.amc, 35_12.amc, 35_34.amc, 39_08.amc, 07_02.amc, 16_15.amc, 35_03.amc, 35_13.amc, 38_01.amc, 39_09.amc, 07_03.amc, 16_16.amc, 35_04.amc, 35_14.amc, 38_02.amc, 39_10.amc, 07_06.amc, 16_21.amc, 35_05.amc, 35_15.amc, 39_01.amc, 39_12.amc, 07_07.amc, 16_22.amc, 35_06.amc, 35_28.amc, 39_02.amc, 39_13.amc, 07_08.amc, 16_31.amc, 35_07.amc, 35_29.amc, 39_03.amc, 39_14.amc, 07_09.amc, 16_32.amc, 35_08.amc, 35_30.amc, 39_04.amc, 07_10.amc, 16_47.amc, 35_09.amc, 35_31.amc, 39_05.amc, 07_11.amc, 16_58.amc, 35_10.amc, 35_32.amc, 39_06.amc, 08_02.amc, 35_01.amc, 35_11.amc, 35_33.amc, 39_07.amc. The number of training, validation and test samples in mocap and mocap-shift splits are 46-5-5 and 43-5-8, respectively. Since sequences are already dense, we skip every other data point. For ease of implementation, we take the last 300 time points, leading to sequences of length . We experiment with different latent dimensionalities and report the findings in Table 12.
Appendix C Architecture and Hyperparameter Details
Appendix D Additional Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013.
- 2Blei et al. [2017] D. M. Blei, A. Kucukelbir, and J. D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association , 2017.
- 3Casale et al. [2018] F. P. Casale, A. Dalca, L. Saglietti, J. Listgarten, and N. Fusi. Gaussian process prior variational autoencoders. Advances in neural information processing systems , 31, 2018.
- 4Chen et al. [2018] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems , 31, 2018.
- 5Chen [2018] R. T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq .
- 6De Brouwer et al. [2019] E. De Brouwer, J. Simm, A. Arany, and Y. Moreau. Gru-ode-bayes: Continuous modeling of sporadically-observed time series. Advances in neural information processing systems , 32, 2019.
- 7Dupont et al. [2019] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural OD Es. In Advances in Neural Information Processing Systems , 2019.
- 8Eastwood and Williams [2018] C. Eastwood and C. K. Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations , 2018.
