On Learning Hamiltonian Systems from Data

Tom Bertalan; Felix Dietrich; Igor Mezi\'c; Ioannis G. Kevrekidis

arXiv:1907.12715·physics.comp-ph·February 5, 2020

On Learning Hamiltonian Systems from Data

Tom Bertalan, Felix Dietrich, Igor Mezi\'c, Ioannis G. Kevrekidis

PDF

TL;DR

This paper introduces a data-driven method to learn Hamiltonian systems from observations using neural networks and Gaussian processes, enabling the extraction of phase space and energy functions without assuming specific forms.

Contribution

It presents a novel neural network and Gaussian process framework for identifying Hamiltonian systems directly from data, capturing conserved quantities and system dynamics.

Findings

01

Successfully extracted phase space and Hamiltonian from pendulum videos

02

Demonstrated the approach on multiple illustrative examples

03

No prior assumptions on Hamiltonian form required

Abstract

Concise, accurate descriptions of physical systems through their conserved quantities abound in the natural sciences. In data science, however, current research often focuses on regression problems, without routinely incorporating additional assumptions about the system that generated the data. Here, we propose to explore a particular type of underlying structure in the data: Hamiltonian systems, where an "energy" is conserved. Given a collection of observations of such a Hamiltonian system over time, we extract phase space coordinates and a Hamiltonian function of them that acts as the generator of the system dynamics. The approach employs an autoencoder neural network component to estimate the transformation from observations to the phase space of a Hamiltonian system. An additional neural network component is used to approximate the Hamiltonian function on this constructed space, and…

Tables1

Table 1. Table 1: Comparison of loss between training and held-out validation data. As the sizes of our generated datasets are always much larger than the numbers of trainable parameters, overfitting is not an issue, and the validation loss is always indistinguishable from the training loss. The first row includes the results from the Gaussian process approximation (a non-parametric method) using all training data for inference. Only the hyper-parameters ϵ italic-ϵ \epsilon and σ 𝜎 \sigma needed to be adjusted. For the first row, the loss reported is the MSE of the Gaussian Process regression. For other rows, the loss is computed by either Eqn. ( 12 ) or Eqn. ( 14 ) as appropriate.

	# Parameters	# Training Points	# Validation Points	Training Loss	Validation Loss
§III.1	non-parametric	625	200	$2.2 \cdot 10^{- 5}$	$3.5 \cdot 10^{- 5}$
§III.2	337	20000	200	$6.8 \cdot 10^{- 4}$	$7.4 \cdot 10^{- 4}$
§IV.2	345	20000	1000	$5.5 \cdot 10^{- 6}$	$5.0 \cdot 10^{- 6}$
§IV.3	511	19200	200	$1.4 \cdot 10^{- 4}$	$1.8 \cdot 10^{- 4}$
§IV.4	755	17489	200	0.51	0.54

Equations36

\overset{q}{˙} (t)

\overset{q}{˙} (t)

\overset{p}{˙} (t)

\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}H(q,p)=\frac{\partial H}{\partial q}(q,p)\cdot\dot{q}+\frac{\partial H}{\partial p}(q,p)\cdot\dot{p}=0.

\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}H(q,p)=\frac{\partial H}{\partial q}(q,p)\cdot\dot{q}+\frac{\partial H}{\partial p}(q,p)\cdot\dot{p}=0.

ω [0 - I I 0] \cdot \nabla H (q, p) - ν (q, p) = 0,

ω [0 - I I 0] \cdot \nabla H (q, p) - ν (q, p) = 0,

H (q, p) = \frac{p ^{2}}{2} + (1 - cos (q)) .

H (q, p) = \frac{p ^{2}}{2} + (1 - cos (q)) .

k (x, x^{'}) = exp (- ∥ x - x^{'} ∥^{2} / ϵ^{2}),

k (x, x^{'}) = exp (- ∥ x - x^{'} ∥^{2} / ϵ^{2}),

E [\hat{H} (y) ∣ X, H (X)] = k (y, X)^{T} k (X, X^{'})^{- 1} H (X),

E [\hat{H} (y) ∣ X, H (X)] = k (y, X)^{T} k (X, X^{'})^{- 1} H (X),

\begin{array}[]{rcl}\frac{\partial}{\partial y}\hat{H}(y)\biggr{\rvert}_{y=x_{i}}&\approx&g(x_{i}),\\ \iff\frac{\partial}{\partial y}\hat{k}(y,X)^{T}\biggr{\rvert}_{y=x_{i}}k(X,X^{\prime})^{-1}H(X)&\approx&g(x_{i}).\end{array}

\begin{array}[]{rcl}\frac{\partial}{\partial y}\hat{H}(y)\biggr{\rvert}_{y=x_{i}}&\approx&g(x_{i}),\\ \iff\frac{\partial}{\partial y}\hat{k}(y,X)^{T}\biggr{\rvert}_{y=x_{i}}k(X,X^{\prime})^{-1}H(X)&\approx&g(x_{i}).\end{array}

\frac{\partial}{\partial x}k(x_{i},Y):=\frac{\partial}{\partial x}k(x,x^{\prime})\biggr{\rvert}_{x=x_{i},x^{\prime}\in Y}

\frac{\partial}{\partial x}k(x_{i},Y):=\frac{\partial}{\partial x}k(x,x^{\prime})\biggr{\rvert}_{x=x_{i},x^{\prime}\in Y}

\in R^{(2 n N + 1) \times M} \frac{\partial}{\partial x} k (x_{1}, Y)^{T} k (Y, Y^{'})^{- 1} \frac{\partial}{\partial x} k (x_{2}, Y)^{T} k (Y, Y^{'})^{- 1} \dots \frac{\partial}{\partial x} k (x_{N}, Y)^{T} k (Y, Y^{'})^{- 1} k (x_{0}, Y)^{T} k (Y, Y^{'})^{- 1} \cdot \in R^{M} [H (Y)] = R^{2 n N + 1} [g (X) H_{0}] .

\in R^{(2 n N + 1) \times M} \frac{\partial}{\partial x} k (x_{1}, Y)^{T} k (Y, Y^{'})^{- 1} \frac{\partial}{\partial x} k (x_{2}, Y)^{T} k (Y, Y^{'})^{- 1} \dots \frac{\partial}{\partial x} k (x_{N}, Y)^{T} k (Y, Y^{'})^{- 1} k (x_{0}, Y)^{T} k (Y, Y^{'})^{- 1} \cdot \in R^{M} [H (Y)] = R^{2 n N + 1} [g (X) H_{0}] .

x_{l} = σ_{l} (x_{l - 1} \cdot W_{l} + b_{l}), l = 1, \dots, L + 1

x_{l} = σ_{l} (x_{l - 1} \cdot W_{l} + b_{l}), l = 1, \dots, L + 1

f (q, p, \overset{q}{˙}, \overset{p}{˙}; w) = k = 1 \sum 4 c_{k} f_{k},

f (q, p, \overset{q}{˙}, \overset{p}{˙}; w) = k = 1 \sum 4 c_{k} f_{k},

\begin{array}[]{rclrcl}\\ f_{1}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}-\dot{q}\right)^{2},&f_{2}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}+\dot{p}\right)^{2},\\ f_{3}&=&\left(\hat{H}(q_{0},p_{0})-H_{0}\right)^{2},&f_{4}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}\dot{q}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}\dot{p}\right)^{2},\end{array}

\begin{array}[]{rclrcl}\\ f_{1}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}-\dot{q}\right)^{2},&f_{2}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}+\dot{p}\right)^{2},\\ f_{3}&=&\left(\hat{H}(q_{0},p_{0})-H_{0}\right)^{2},&f_{4}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}\dot{q}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}\dot{p}\right)^{2},\end{array}

\begin{array}[]{rcl}\\ f(\hat{q},\hat{p},\dot{\hat{q}},\dot{\hat{p}};\,w)&=&\sum_{k}c_{k}f_{k},\\ f_{1}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}-\dot{\hat{q}}\right)^{2},\\ f_{2}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}+\dot{\hat{p}}\right)^{2},\\ f_{3}&=&\left(\hat{H}(\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}(x_{0},y_{0}))-H_{0}\right)^{2},\\ f_{4}&=&\left(\dot{\hat{H}}\right)^{2}=\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial\hat{q}}}\dot{\hat{q}}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial\hat{p}}}\dot{\hat{p}}\right)^{2},\\ f_{5}&=&||\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}(\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}(x,y))-[x,y]||^{2},\\ f_{6}&=&\left(\det(D\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1})\right)^{-2},\end{array}

\begin{array}[]{rcl}\\ f(\hat{q},\hat{p},\dot{\hat{q}},\dot{\hat{p}};\,w)&=&\sum_{k}c_{k}f_{k},\\ f_{1}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}-\dot{\hat{q}}\right)^{2},\\ f_{2}&=&\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}+\dot{\hat{p}}\right)^{2},\\ f_{3}&=&\left(\hat{H}(\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}(x_{0},y_{0}))-H_{0}\right)^{2},\\ f_{4}&=&\left(\dot{\hat{H}}\right)^{2}=\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial\hat{q}}}\dot{\hat{q}}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial\hat{p}}}\dot{\hat{p}}\right)^{2},\\ f_{5}&=&||\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}(\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}(x,y))-[x,y]||^{2},\\ f_{6}&=&\left(\det(D\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1})\right)^{-2},\end{array}

\dot{\overset{q}{^}}

\dot{\overset{q}{^}}

\dot{\overset{p}{^}}

\partial \hat{H} / \partial q

\partial \hat{H} / \partial p

\begin{array}[]{rclrcl}a&=&q/20&b&=&p/10\\ x&=&a+(b+a^{2})^{2}&y&=&b+a^{2},\\ \end{array}

\begin{array}[]{rclrcl}a&=&q/20&b&=&p/10\\ x&=&a+(b+a^{2})^{2}&y&=&b+a^{2},\\ \end{array}

\begin{array}[]{rcl}p(\tau)&=&p(0)+\tau\cdot\left.\dot{p}\right|_{q(0),p(0)}\\ q(\tau)&=&q(0)+\tau\cdot\left.\dot{q}\right|_{q(0),p(\tau)}\end{array}

\begin{array}[]{rcl}p(\tau)&=&p(0)+\tau\cdot\left.\dot{p}\right|_{q(0),p(0)}\\ q(\tau)&=&q(0)+\tau\cdot\left.\dot{q}\right|_{q(0),p(\tau)}\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On Learning Hamiltonian Systems from Data

Tom Bertalan

Department of Mechanical Engineering, The Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

Felix Dietrich

Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland 21211, USA

Igor Mezić

Department of Mechanical Engineering, The University of California Santa Barbara, Santa Barbara, California 93106, USA

Ioannis G. Kevrekidis

Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland 21211, USA

[email protected]

Abstract

Concise, accurate descriptions of physical systems through their conserved quantities abound in the natural sciences. In data science, however, current research often focuses on regression problems, without routinely incorporating additional assumptions about the system that generated the data. Here, we propose to explore a particular type of underlying structure in the data: Hamiltonian systems, where an “energy” is conserved. Given a collection of observations of such a Hamiltonian system over time, we extract phase space coordinates and a Hamiltonian function of them that acts as the generator of the system dynamics. The approach employs an autoencoder neural network component to estimate the transformation from observations to the phase space of a Hamiltonian system. An additional neural network component is used to approximate the Hamiltonian function on this constructed space, and the two components are trained jointly. As an alternative approach, we also demonstrate the use of Gaussian processes for the estimation of such a Hamiltonian. After two illustrative examples, we extract an underlying phase space as well as the generating Hamiltonian from a collection of movies of a pendulum. The approach is fully data-driven, and does not assume a particular form of the Hamiltonian function.

Hamiltonian systems, neural networks, Gaussian processes

Neural network-based methods for modeling dynamical systems are becoming again widely used, and methods that explicitly learn the physical laws underlying continuous observations in time constitute a growing subfield. Our work contributes to this thread of research by incorporating additional information into the learned model; namely, the knowledge that the data arise as observations of an underlying Hamiltonian system.

We use machine learning to extract models of systems whose dynamics conserve a particular quantity (the Hamiltonian). We train several neural networks to approximate the total energy function for a pendulum, in both its natural action-angle form and also as seen through several distorting observation functions of increasing complexity. A key component of the approach is the use of automatic differentiation of the neural network in formulating the loss function that is minimized during training.

Our method requires data evaluating the first and second time derivatives of observations across the regions of interest in state space or, alternatively, sufficient information (such as a sequence of delayed measurements) to estimate these. We include examples in which the observation function is nonlinear, and high-dimensional.

I Introduction

Current data science exploration of dynamics often focuses on regression or classification problems, without routinely incorporating additional assumptions about the nature of the system that generated the data. This has started to change recently, with approaches to extract generic dynamical systems by schmidt-2009 , specifying the variables and possible expressions for the formulas beforehand. In particular, treating the central object to be modeled as the discrete-time flowmap $\Phi_{\tau}({\mathbf{x}})={\mathbf{x}}+\int_{0}^{\tau}{\mathbf{f}}({\mathbf{x}}(t)){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t$ but learning $\Phi$ directly as a black box may result in qualitative differences from the true systemRico-Martinez1992a . Using the associated differential equation ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}{\mathbf{x}}/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t={\mathbf{f}}({\mathbf{x}})$ instead, allows us to exploit established numerical integration schemes to help approximate the flow map, and can be accomplished with a neural network by structuring the loss function similarly to such classical numerical integration schemes. In addition to our older work along these lines, Gonzalez-Garcia1998b ; Rico-Martinez1994 ; Rico-Martinez1994b ; Rico-Martinez1993 ; Rico-Martinez1995 ; Rico-Martinez1992a ; rico-martinez_noninvertibility_1993 ; Rico-Martinez2000d , more recent work has revived this approach, focused on treating the layers of a deep neural network as iterates of a dynamical system, where “learning” consists of discovering the right attractors e_proposal_2017 ; chang_multi-level_2018 ; lu_beyond_2018 . In particular, interest has focused on how residual networkshe_deep_2016 and highway networkssrivastava_highway_2015 can be interpreted as iterative solversgreff_highway_2017 or as iterated dynamical systemschen_neural_2018 . In the latter paper (a NeurIPS 2018 best paper awardee), the authors choose not to unroll the iteration in time explicitly, but instead use continuous-time numerical integration. While the focus was on the dynamics-of-layers concept, time series learning was also performed.

The Koopman operator has also been employed in combination with neural networks to extract conservation laws and special group structures kaiser-2018 ; lusch_deep_2018 . Symmetries in relation to conserved quantities are a well-studied problem in physics hamilton_general_1834 ; noether-1971 ; livio-2012 ; kondor-2018b . A recent thread of research consists of learning physics models from observation data mottaghi_newtonian_2016 , including modeling discrete-time data as observations of a continuous-time dynamical system raissi_inferring_2017 ; raissi_hidden_2018 .

It is informative to study physical systems through their conserved quantities, such as total energy, which can be encoded in a Hamiltonian function. hamilton_general_1834 ; almeida1992 The measure-preserving property of Hamiltonian systems has recently been exploited for transport of densities in Markov-chain Monte-Carlo methods brooks_mcmc_2011 , and variational autoencoders caterini_hamiltonian_2018 ; rezende_variational_2015 . For our purposes, a natural progression of the thread of computational modeling of physics from observations is to represent the Hamiltonian function directly.

Concurrently with this submission, two papers independently addressing similar issues appeared as preprintsgreydanus_hamiltonian_2019 ; toth_hamiltonian_2019 . In the first of thesegreydanus_hamiltonian_2019 , a loss function very similar to parts of our Eqn. (12) was used. The second paper focuses on the transformation of densities generated through the Hamiltonian. It is possible to draw some analogies between the second preprint and the old (non-Hamiltonian) work we mentioned aboveGonzalez-Garcia1998b ; Rico-Martinez1994 ; Rico-Martinez1994b ; Rico-Martinez1993 ; Rico-Martinez1995 ; Rico-Martinez1992a ; rico-martinez_noninvertibility_1993 ; Rico-Martinez2000d , as this newer work also used rollout with a time step templated on a classical numerical integration method (here, symplectic Euler and leapfrog). Both papers also use the pendulum as an example, emphasizing conditions where the system can be well approximated by a linear system: a thin annulus around the stable steady state at $(q,p)=(0,0)$ where the trajectories are nearly circular.

The remainder of the paper is structured as follows:

We derive data-driven approximations (through two approaches: Gaussian processes and neural networks) of a Hamiltonian function on a given phase space, from time series data. The Hamiltonian functions we consider do not need to be separable as a sum $H(q,p)=T(p)+V(q)$ , and in our illustrative example we always work in the fully nonlinear regime of the pendulum. 2. 2.

We build data-driven reconstructions of a phase space from (a) linear and (b) nonlinear, non-symplectic transformations of the original Hamiltonian phase space. The reconstruction then leads to a symplectomorphic copy of the original Hamiltonian system. 3. 3.

We construct a completely data-driven pipeline combining (a) the construction of an appropriate phase space and (b) the approximation of a Hamiltonian function on this new phase space, from nonlinear, high-dimensional observations (e.g. from movies/sequences of movie snapshots).

II General description

A Hamiltonian system on Euclidean space $E=\mathbb{R}^{2n}$ , $n\in\mathbb{N}$ is determined through a function $H:E\to\mathbb{R}$ that defines the equations

[TABLE]

where $\dot{(\quad)}:={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t$ , and $q(t),p(t)\in\mathbb{R}^{n}$ are interpreted as “position” and “momentum” coordinates in the “phase space” $E$ . In many mechanical systems, and in all examples we discuss in this paper, the interpretation of the coordinates $q,p$ is reflected in the dynamics through $\dot{q}=p$ , i.e. $H(q,p)=\frac{1}{2}p^{2}+h(q)$ for some function $h:\mathbb{R}^{n}\to\mathbb{R}^{n}$ . In general, the equations (1, 2) imply that the Hamiltonian is constant along trajectories $(q(t),p(t))$ , because

[TABLE]

Equations (1, 2) can be restated as a partial differential equation for $H$ at every $(q,p)\in E$ :

[TABLE]

where $I\in\mathbb{R}^{n\times n}$ is the identity matrix and $\nu$ is the vector field on $E$ (the left hand side of (1, 2)), which only depends on the state $(q,p)$ . The symplectic form on the given Euclidean space takes the form of the matrix $\omega$ .

In the next section, we discuss how to approximate the function $H$ from given data points $D=\{(q_{i},\dot{q}_{i},\ddot{q}_{i})\}_{i=1}^{N}$ . This involves solving the partial differential Eqn. (4) for $H$ . Since these equations determine $H$ only up to an additive constant, we assume that we also know the value $H_{0}=H(q_{0},p_{0})$ of $H$ at a single point $(q_{0},p_{0})$ in phase space. This is not a major restriction for the approach, because $H_{0}$ as well as $(q_{0},p_{0})$ can be chosen arbitrarily.

III Example: the nonlinear pendulum

As an example, consider the case $n=1$ , and the Hamiltonian

[TABLE]

This Hamiltonian forms the basis for the differential equations of the nonlinear pendulum, $\ddot{q}=-\sin(q)$ , or, in first-order form, $\dot{q}=\partial H(q,p)/\partial p=p$ and $\dot{p}=-\partial H(q,p)/\partial q=-\sin(q)$ . In this section, we numerically solve PDE (4) by approximating the solution $H$ using two approaches: Gaussian Processes rasmussen-2005 (§III.1) and neural networks (§III.2).

III.1 Approximation using Gaussian processes

We model the solution $H$ as a Gaussian Process $\hat{H}$ with a Gaussian covariance kernel,

[TABLE]

where $x$ and $x^{\prime}$ are points in the phase space, i.e. $x=(q,p)$ , $x^{\prime}=(q^{\prime},p^{\prime})$ and $\epsilon\in\mathbb{R}^{+}$ is the kernel bandwidth parameter (we chose $\epsilon=2$ in this paper). Given a collection $X$ of $N$ points in the phase space, as well as the function values $H(X)$ at all points in $X$ , the conditional expectation of the Gaussian Process $\hat{H}$ at a new point $y$ is

[TABLE]

where we write $\left[k(X,X^{\prime})\right]_{i,j}:=k(x_{i},x_{j})$ for the kernel matrix evaluated over all $x$ values in the given data set $X$ . In Eqn. (7), the dimensions of the symbols are $y\in\mathbb{R}^{2n}$ , $k(X,X^{\prime})\in\mathbb{R}^{N\times N}$ , $k(y,X)\in\mathbb{R}^{N}$ , and $H(X):=(H(x_{1}),H(x_{2}),\dots,H(x_{N}))\in\mathbb{R}^{N}$ . All vectors are column vectors. Estimates of the solution $H$ to the PDE at new points depend on the value of $H$ over the entire data set. We do not know the values of $H$ , but differentiating Eqn. (7) allows us to set up a system of equations to estimate $H$ at an arbitrary number $M$ of new points $y$ close to the points $x\in X$ , by using the information about the derivatives of $H$ given by the time derivatives $\dot{q}$ in Eqn. (1) and (the negative of) $\dot{p}$ in Eqn. (2). Let $g(x_{i})\in\mathbb{R}^{2n}$ be the gradient of $H$ at a point $x_{i}$ in the phase space, which we know from data $(\dot{q},\dot{p})$ . Then,

[TABLE]

Together with an arbitrary pinning term at a point $x_{0}$ , the list of known derivatives leads to a linear system of $2nN+1$ equations, where we write

[TABLE]

for the derivative of the given kernel function $k$ with respect to its first argument, evaluated at a given point $x_{i}$ in the first argument and all points in the dataset $Y$ in the second argument. For each $x_{i}$ , we thus have a (column) vector of $M$ derivative evaluations, which is transposed and stacked into a large matrix to form the full linear system

[TABLE]

We reiterate: $X$ is a data set of $N$ points where we know the derivatives of $H$ through $g(x_{i})=(\frac{\partial H}{\partial q}(x_{i}),\frac{\partial H}{\partial p}(x_{i}))^{T}=(-\dot{p}_{i},\dot{q}_{i})^{T}\in\mathbb{R}^{2n}$ . We evaluate on a fine grid $Y$ of $M$ points (such that $k(Y,Y^{\prime})\in\mathbb{R}^{M\times M}$ , $\frac{\partial}{\partial x}k(x_{i},Y)\in\mathbb{R}^{2n\times M}$ ) and have information $g(X)\in\mathbb{R}^{2nN}$ on a relatively small set of $N$ points called $X$ (black dots in Fig. 1. The derivative of the Gaussian Process can be stated using the derivative of the kernel $k$ with respect to the first argument. The matrix inverse for $k(Y,Y^{\prime})$ is approximated by the inverse of $(k(Y,Y^{\prime})+\sigma^{2}I)$ , that is, through Tikhonov regularization with parameter $\sigma=10^{-5}$ , as is the standard for Gaussian Process regression. Solving this system of equations for $H(Y)$ yields the approximation for the solution to the PDE. Fig. 1 shows the result, and table I lists the mean squared error to the 625 training data points and an independently drawn set of 200 validation points in the same domain, where no derivatives were available. See raissi_inferring_2017 for a more detailed discussion of the solution of PDE with Gaussian Processes.

III.2 Approximation using an artificial neural network

Another possibility for learning the form of $H$ using data is to represent the function with an artificial neural networkGoodfellow-et-al-2016 (ANN). We write

[TABLE]

where the activation function $\sigma_{l}$ is nonlinear (except where otherwise indicated, we used $\tanh$ ) for $l=1,\ldots,L$ (if $L\geq 1$ ) and the identity for $l=L+1$ . The learnable parameters of this ANN are $\{(\mathbf{W}_{l},\mathbf{b}_{l})\}_{l=1,\ldots,L+1}$ , and we gather all such learnable parameters from the multiple layers that may be used in one experiment into a parameter vector $w$ . If there are no hidden layers ( $L=0$ ), then we learn an affine transformation $\mathbf{x}_{1}=\mathbf{x}_{0}\cdot\mathbf{W}+\mathbf{b}$ . This format provides a surrogate function $\hat{H}(q,p)=x_{L+1}$ , where the input $x_{0}$ is the row vector $[q,p]$ . (Treating inputs as row vectors and using right-multiplication for weight matrices is convenient, as a whole batch of $N$ inputs can be presented as an $N$ -by- $2$ array.) For all the experiments shown here, this network for computing $\hat{H}$ has two hidden layers of width 16.

Similarly, in the case(s) we need to learn additional transformations $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}$ and $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ (see §IV), they are also learned using such networks.

We collect training data by sampling a number of initial conditions in the rectangle $(q,p)\in[-2\pi,2\pi]\times[-6,6]$ , then simulate short trajectories from each to their final $(q,p)$ points. For each of these, we additionally evaluate $(\dot{q},\dot{p})$ . Shuffling over simulations once per epoch, and dividing this dataset into batches, we then perform batchwise stochastic gradient descent to learn the parameters $w$ using an Adam optimizer on the objective function defined below.

For this paper, all neural networks were constructed and trained using TensorFlow, and the gradients necessary for evaluating the Hamiltonian loss terms in Eqn. (12) were computed using TensorFlow’s core automatic differentiation capabilities.

This objective function comprises a scalar function evaluated on each data 4-tuple $d=(q,p,\dot{q},\dot{p})$ in the batch, and then averaged over the batch. This scalar function is written as

[TABLE]

where the dependence on $w$ is through the learned Hamiltonian $\hat{H}$ , the loss-term weights $c_{k}$ are chosen to emphasize certain properties of the problem thus posed, and the partial derivatives of ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}}$ and ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}}$ are computed explicitly through automatic differentiation. Except for $c_{2}$ , all $c_{k}$ values are set to either $1$ or [math] depending on whether the associated loss term is to be included or excluded. Because of the square term in Eqn. (5), we set $c_{2}$ arbitrarily to $10$ if nonzero, so the loss is not dominated by $f_{1}$ . An alternative might be to set $c_{1}$ to $1/10$ .

Since equations (1) and (2) together imply (3), any one of the three terms $f_{1}$ , $f_{2}$ , and $f_{4}$ could be dropped as redundant; therefore, we can set $c_{4}$ to zero, but monitor $\dot{\hat{H}}$ as a useful sanity check on the accuracy of the learned solution. In Fig. 1, we show the results of this process with our default nonzero values for $c_{k}$ .

As an ablation study, we explored the effect (not shown here) of removing the first, second, and fourth terms. By construction, the true $H_{t}(q,p)$ function is zero for all $(q,p)$ . Note that this is only ever achieved to any degree in the central box in the figure, where data was densely sampled. Removing $f_{4}$ made no visible difference in the quality of our $\hat{H}_{t}\approx 0$ approximation, which was expected due to the redundancy in the set of equations (1), (2), and (3). However, removing either $f_{1}$ or $f_{2}$ gives poor results across the figure, despite the apparent redundancy of these terms with $f_{4}$ . This might be due to not balancing the contributions of the $\dot{p}$ and $\dot{q}$ terms, for which we attempted to compensate by unequal weighting values $c_{1}$ and $c_{2}$ .

IV Estimating Hamiltonian structure from observations

We now consider a set of observation functions ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}:E\to\mathbb{R}^{M}$ , ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}=(y_{1},\dots,y_{M})$ , with $M\geq\dim E=2n$ , such that ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ is a diffeomorphism between the phase space $E$ and its image ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)$ . In this setting, the notion of a symplectomorphism is important almeida1992 . In general, a symplectomorphism is a diffeomorphism that leaves the symplectic structure on a manifold invariant. In our setting, a symplectomorphism of $E=Q\times P$ maps to a deformed space $\hat{E}=\hat{Q}\times\hat{P}$ where the system dynamics in the new variables $\hat{q}\in\hat{Q}$ , $\hat{p}\in\hat{P}$ is again Hamiltonian, and conjugate to the original Hamiltonian dynamics. Not every diffeomorphism is a symplectomorphism, and we do not assume that ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ is a symplectomorphism. A constant scaling of the coordinates is also not possible to distinguish from a scaling in the Hamiltonian function itself, so the recovered system will be a symplectomorphic copy of the original, scaled by an arbitrary constant.

In the setting of this section, we do not assume access to $E$ , $H$ , or the explicit form of ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ . Only a collection of points ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}_{i}$ and‘ time derivatives $\frac{d}{dt}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}_{i}$ in the image ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)$ is available. We describe an approach to approximate a new map $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}:{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)\to\hat{E}$ into a symplectomorphic copy of $E$ through an autoencoder Goodfellow-et-al-2016 , such that the transformed system in $\hat{E}$ is conjugate to the original Hamiltonian system in $E$ . Upon convergence, and if we had access to ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ , the map $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}\circ{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}\equiv{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}S}:E\to\hat{E}$ would approximate a symplectomorphism, and ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}^{-1}\circ\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}\equiv{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}S}^{-1}$ would be its inverse. During the estimation of $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ , we simultaneously approximate the new Hamiltonian function $\hat{H}:\hat{E}\to\mathbb{R}$ . Fig. 2 visualizes the general approach, where only the information $(x,y)_{i}$ and $\frac{d}{dt}(x,y)_{i}$ is available to the procedure, while $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ and a Hamiltonian $\hat{H}$ on $\hat{E}$ are constructed numerically.

An interesting and important feature common to the next three examples is that one cannot expect to systematically recover the original $(q,p)$ values from the given observation data $(x,y)$ . Only symplectic transformations $(\hat{q},\hat{p})$ can be recovered, which are enough to define a Hamiltonian. Once the coordinates $(\hat{q},\hat{p})$ are fixed, the Hamiltonian function in these coordinates is unique up to an additive constant.

Table I contains training and validation loss of the networks for all experiments. Additionally, we trained the network from §IV.4 with only 331 images, and did observe a significantly higher validation loss (not shown), consistent with overfitting.

IV.1 A composite loss function for the joint learning of a transformation and a Hamiltonian

The following loss function is used to train an autoencoder component together with a Hamiltonian function approximation network component:

[TABLE]

where the dependence on $w$ is through the learned Hamiltonian $\hat{H}$ and the learned transformations $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}$ and $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ , and the time derivatives in the space $\hat{E}$ are computed as $\dot{\hat{q}}=\dot{x}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial x}}+\dot{y}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial y}}$ and $\dot{\hat{p}}=\dot{x}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial x}}+\dot{y}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial y}}$ , using the Jacobian $D\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}=\left[\begin{array}[]{cc}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial x}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial y}}\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial x}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial y}}\\ \end{array}\right]$ of the transformation $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ (computed pointwise with automatic differentiation). When we learn an (especially nonlinear) transformation $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ , in addition to the Hamiltonian $\hat{H}(\hat{q},\hat{p})$ , including the $f_{4}$ term in the composite loss can have a detrimental effect on the learned transformation. There exists an easily-encountered naive local minimum in which $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ maps all of the sampled values from ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)\ni(x,y)$ to a single point in $\hat{E}$ , and the Hamiltonian learned is merely the constant function at the pinning value, $\hat{H}(\hat{q},\hat{p})=\hat{H}_{0}$ . In this state (or an approximation of this state), all of ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial H}{\partial\hat{q}}}$ , ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial H}{\partial\hat{p}}}$ , and the elements of $D\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ are zero, so the loss (with terms $f_{1}$ , $f_{2}$ , $f_{3}$ , and optionally $f_{4}$ ) is zero (resp., small). A related failure is that in which the input in ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)$ is collapsed by $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ to a line or curve in $\hat{E}$ .

To alleviate both of these problems we added a new loss component $f_{6}$ . That is, we require that the learned transformation not collapse the input. It is sufficient for the corresponding weighting factor $c_{6}$ to be a very small nonzero value (e.g. $10^{-6}$ ). The addition of $f_{6}$ to our loss helps us to avoid falling early in training into the unrecoverable local minimum described above, and also helps keep the scale of the transformed variables $\hat{q}$ and $\hat{p}$ macroscopic.

IV.2 Example: linear transformation of the pendulum

We generate data from a rectangular region $x\in[-1,1]$ , $y\in[-1,1]$ , then transform the region linearly with ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}^{-1}(x,y)=A^{-1}[x,y]^{T}=[q,p]^{T}$ . The matrix $A^{-1}$ is the inverse of $A=R\cdot\Lambda$ ; a scaling followed by a rotation where $\Lambda=\left[\begin{array}[]{cc}\lambda_{1}&0\\ 0&\lambda_{2}\end{array}\right]$ , $\lambda_{1}=1$ , $\lambda_{2}=64$ , $R=\left[\begin{array}[]{cc}\cos\rho&-\sin\rho\\ \sin\rho&\cos\rho\end{array}\right]$ , and $\rho=5^{\circ}$ . Our observation data ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)$ is thus given by $[x,y]^{T}=A\cdot[q,p]^{T}$ . Using the true Hamiltonian $H(q,p)=p^{2}/2+(1-\cos q)$ , we additionally compute true values for $dq/dt$ and $dp/dt$ , and then use $A$ to propagate these to $x$ and $y$ via $\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}x}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}=\frac{\partial x}{\partial q}\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}q}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}+\frac{\partial y}{\partial p}\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}p}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}$ and similar for $y$ , where the partial derivatives are computed analytically (here, just the elements of $A$ itself).

Our network is then presented with observation data $x,y$ and its corresponding time-derivatives. Its task is to learn $\hat{A}$ and $\hat{A}^{-1}$ , which convert to and from variables $\hat{q},\hat{p}$ (symplectomorphic to the original $q,p$ ); and a Hamiltonian $\hat{H}$ in this new space. When evaluating the loss, the time derivatives of $\hat{q}$ and $\hat{p}$ are likewise computed via automatic differentiation using the chain rule through the learned transformation, e.g. as $\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}\hat{q}}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}=\frac{\partial\hat{q}}{\partial x}\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}x}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}+\frac{\partial\hat{q}}{\partial y}\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}y}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t}$ . Note also that $[\hat{q},\hat{p}]^{T}=\hat{A}^{-1}\cdot A\cdot[q,p]^{T}$ , so if the original space $E$ could be found, $\hat{A}$ would satisfy $\hat{A}\cdot A^{-1}=I$ . This cannot be expected given only the data in ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)$ ; we can only be sure that $\hat{A}\cdot A^{-1}$ approximates a symplectomorphism of the original $E$ .

We could learn $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ from a general class of nonlinear functions, as a small $\tanh$ neural network, but here we simply learn $\hat{A}$ and $\hat{A}^{-1}$ as linear transformations (that is, we have a linear “neural network”, where $L=0$ in Eqn. (11), and $\mathbf{b}_{1}$ is fixed as $\mathbf{0}$ ). As we include the reconstruction error of this autoencoder in our loss function, $\hat{A}^{-1}$ is constrained to be the inverse of $\hat{A}$ to a precision no worse than the $f_{1}$ and $f_{2}$ terms in Eqn. (14), after all three are scaled by their corresponding $c_{k}$ values. In fact, for the linear case, initially the autoencoder’s contribution to the loss is significantly lower than the Hamiltonian components (see Fig. 3), but, as training proceeds and the $f_{1}$ and $f_{2}$ terms are improved (at the initial expense of raising the autoencoder loss), larger reductions in loss are possible by optimizing $\hat{H}$ rather than $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}$ , so $f_{5}$ is decreased as quickly as (the larger of) $f_{1}$ or $f_{2}$ . That is, the autoencoder portion of the loss falls quickly to the level where it no longer contributes to the total loss given its weighting in the loss sum.

We find that the learned symplectomorphism ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}S}(q,p)=\hat{A}^{-1}\cdot A\cdot[q,p]^{T}$ , depicted in its $q$ portion in Fig. 3, preserves $q$ unmixed with $p$ in one or the other of its two discovered coordinates. This is because both (a) $(q,p)\mapsto(p,-q)$ as well as (b) $(q,p)\mapsto(q,p+f(q))$ for any smooth function $f$ are symplectomorphisms. They are special because $H(q,p)=\hat{H}(\hat{q}(q,p),\hat{p}(q,p))$ , i.e. they even preserve the Hamiltonian formulation. For the map (a), the transformation of the Hamiltonian can be seen from the following derivation.

[TABLE]

Here, the first equality of (15, 16) follows from the map and the last equality of (15, 16) follows from the requirement that $\hat{q},\hat{p}$ follow Hamiltonian dynamics with respect to the new Hamiltonian $\hat{H}$ . Equations (IV.2,IV.2) then show that the new Hamiltonian is the same as the old one (modulo an additive constant) when considered as a map on the old coordinates.

IV.3 Example: nonlinear transformation of the pendulum

In addition to the linear ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ of §IV.2, we show comparable results for a nonlinear transformation ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ and learned $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}$ . Specifically, we transform the data through $(x,y)={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(q,p)$ where

[TABLE]

the inverse of which is given by $q=(x-y^{2})20$ and $p=(y-x^{2}+2xy^{2}-y^{4})10$ . We use the analytical Jacobian of this ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ to compute the necessary $\dot{x}$ and $\dot{y}$ for input to our network.

We proceed as before, except that we no longer restrict the form of the learned $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}$ and $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ to linear transformations, but instead allow small multi-layer perceptrons of a form similar to that used for $\hat{H}$ .

The resulting induced symplectomorphism is again one which appears to preserve an approximately monotonically increasing or decreasing $q$ in either $\hat{q}$ or $\hat{p}$ . This can be seen in Fig. 4.

IV.4 Example: constructing a Hamiltonian system from nonlinear, high-dimensional observations of $q,p$

As a further demonstration of the method, we use a graphical rendering of the moving pendulum example from before as the transformation ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}$ from the intrinsic state $(q,p)$ to an image ${\mathbf{x}}$ as our high-dimensional observable. We use a symplectic semi-implicit Euler’s method

[TABLE]

to generate $q(t_{i}),p(t_{i})$ trajectories for various initial conditions, and then a simple graphical renderer to display these as images (see Fig. 5). When rendering our video frames, we drag a tail of decaying images behind the moving pendulum head, so that information about both position $q$ and velocity $p$ is present in each rendered frame. This is done by iterating over each $q(t_{i}),p(t_{i})$ trajectory, and, for each $t_{i}$ , (1) multiplying the entire current image by a cooling factor of $\sim 0.96$ , (2) adding a constant heating amount to the image in a circle of fixed radius centered around the current $\cos(q),\sin(q)$ point, and (3) clipping the image per-pixel to lie within $[0,1]$ . Samples of the resulting images are visible in Fig. 5.

Though we do not use $p(t_{i})$ directly in this procedure, is value is observed in the length of the resulting tail dragged behind the moving pendulum head. We create trajectories long enough that the effect of the initial formation of the tail (during which its length is not necessarily a good indicator of $p$ ) is not visible any more, and then use only the final two observations from these trajectories.

In order to make the approach agnostic to the data, we do not want to assume that the space $\hat{E}$ is periodic, so instead, we use a four-dimensional phase space with elements ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{z}}}}=[\hat{q}_{1},\hat{q}_{2},\hat{p}_{1},\hat{p}_{2}]=[{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{q}}}},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{p}}}}]$ and consider the splitting into $(\hat{q}_{1},\hat{q}_{2})$ and $(\hat{p}_{1},\hat{p}_{2})$ during training. In the space of input images, the manifold does not fill up four-dimensional space, but a cylinder, which is mapped to the four-dimensional encoding layer by the autoencoder.

In addition, to simplify the learning problem, we learn $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ as the combination of a projection onto the first twenty principal components of the training dataset followed by a dense autoencoder, reserving learning $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ as an end-to-end convolutional autoencoder for future work. The encoding provides ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{z}}}}$ and, as before, we learn $\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}$ in tandem with $\hat{H}$ , where now the conditions of Eqn. (14) are upgraded to vector equivalents to accommodate ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{z}}}}$ .

In §IV.1, we added a loss term proportional to the reciprocal of the determinant of the transformation’s Jacobian in order to avoid transformations that collapsed the phase space. Here, this was not such an issue–of course, some collapsing of the high-dimensional representation is obviously required. Instead, a common mode of failure turned out to be learning constant $\hat{H}$ functions, which automatically satisfy the Hamiltonian requirements (a constant is naturally a conserved quantity). To avoid this, we considered several possible ways to promote a non-flat $\hat{H}$ function, ultimately settling on (a) adding a term that encouraged the standard deviation of $\hat{H}$ values to be nonzero, and (b) minimizing not just the mean squared error in our $f_{1}$ and $f_{2}$ terms, but also the max squared error, to avoid trivial or bi-level $\hat{H}(q,p)$ functions. The spherical Gaussian prior used in training variational autoencoders toth_hamiltonian_2019 also avoids, as a byproduct, learning constant functions.

The result, shown in Fig. 5, was a pulled-back $\hat{H}(q,p)$ function that at least in broad strokes resembles the truth, and that satisfies ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}\hat{H}/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathrm{d}}t\approx 0$ (typically about $10^{-2}$ ).

V Conclusions

We described an approach to approximate Hamiltonian systems from observation data. It is a completely data-driven pipeline to (a) construct an appropriate phase space and (b) approximate a Hamiltonian function on the new phase space, from nonlinear, possibly high-dimensional observations (here, movies).

When only transformations of the original Hamiltonian phase space can be observed, it is only possible to recover a symplectic copy of the original phase space, with additional freedom in a constant scaling of the coordinates, and an additive constant to the Hamiltonian function. If no additional information about the original space is available, this is a fundamental limitation, and not particular to our approach. It is not necessary that there is an “original phase space” at all, and so the resulting symplectic phase space is by no means unique. The choice of a single such space has to rely on other factors, such as, possibly, interpretability by humans, or simplicity of the equations.

The approach may be extended to time-dependent Hamiltonian functions. This would allow us to cope with certain dissipative systemsmcdonald_hamiltonian_nodate . An even broader extension may allow transformations to arbitrary normal forms as the “target vector field”, and thus would not be constrained to Hamiltonian systems. In the general case, it will become important to explore whether the transformation we approximate remains bounded over our data, or whether it starts showing signs of approaching a singularity, suggesting that the problem may not be solvable.

Acknowledgements.

This work was funded by the US Army Research Office (ARO) through a Multidisciplinary University Research Initiative (MURI) and by the Defense Advanced Research Projects Agency (DARPA) through their Physics of Artificial Intelligence (PAI) program.

References

[1]

A. M. Almeida.

Hamiltonian systems: Chaos and quantization.

Cambridge University Press, 1992.

[2]

Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic.

Hamiltonian Variational Auto-Encoder.

In Advances in Neural Information Processing Systems, volume 32, page 11, 2018.

[3]

B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert.

Multi-level Residual Networks from Dynamical Systems View.

In Proceedings of the International Conference on Machine Learning, February 2018.

[4]

R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud.

Neural Ordinary Differential Equations.

In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.

[5]

Weinan E.

A Proposal on Machine Learning via Dynamical Systems.

Communications in Mathematics and Statistics, 5(1):1–11, 2017.

[6]

R. González-García, R. Rico-Martínez, and I. G. Kevrekidis.

Identification of distributed parameter systems: A neural net based approach.

Computers & Chemical Engineering, 22(98):S965–S968, 1998.

[7]

I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning.

MIT Press, 2016.

[8]

K. Greff, R. K. Srivastava, and J. Schmidhuber.

Highway and Residual Networks learn Unrolled Iterative Estimation.

In Proceedings of the International Conference on Learning Representations, 2017.

[9]

S. Greydanus, M. Dzamba, and J. Yosinski.

Hamiltonian Neural Networks.

In arXiv:1906.01563 [cs], June 2019.

[10]

W. R. Hamilton.

On a general method in dynamics.

Philosophical Transactions of the Royal Society, II, pages 247–308, 1834.

[11]

K. He, X. Zhang, S. Ren, and J. Sun.

Deep Residual Learning for Image Recognition.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 7, pages 171–180, 2016.

[12]

E. Kaiser, J. Nathan Kutz, and S. L. Brunton.

Discovering conservation laws from data for control.

In 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018.

[13]

Mario Livio.

Why symmetry matters.

Nature, 490(7421):472–473, 2012.

[14]

Y. Lu, A. Zhong, Q. Li, and B. Dong.

Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations.

In Proceedings of the 2018 International Conference on Machine Learning, page 10, 2018.

[15]

B. Lusch, J. N. Kutz, and S. L. Brunton.

Deep learning for universal linear embeddings of nonlinear dynamics.

Nature Communications, 9(1):4950, November 2018.

[16]

K. T. McDonald.

Hamiltonian with z as the Independent Variable.

Technical report.

[17]

R. Mottaghi, H. Bagherinezhad, M. Rastegari, and A. Farhadi.

Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images.

In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3521–3529, Las Vegas, NV, USA, 2016. IEEE.

[18]

R. Neal.

MCMC Using Hamiltonian Dynamics.

In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, volume 20116022. Chapman and Hall/CRC, May 2011.

[19]

E. Noether.

Invariant variation problems.

Transport Theory and Statistical Physics, 1(3):186–207, 1971.

[20]

S. Trivedi R. Kondor, Z. Lin.

Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network.

In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10117–10126. Curran Associates, Inc., 2018.

[21]

M. Raissi and G. E. Karniadakis.

Hidden physics models: Machine learning of nonlinear partial differential equations.

Journal of Computational Physics, 357:125–141, 2018.

[22]

M. Raissi, P. Perdikaris, and G. E. Karniadakis.

Inferring solutions of differential equations using noisy multi-fidelity data.

Journal of Computational Physics, 335:736–746, 2017.

[23]

C. E. Rasmussen and C. K. I. Williams.

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning).

The MIT Press, 2005.

[24]

D. J. Rezende and S. Mohamed.

Variational Inference with Normalizing Flows.

In ndProceedings of the 32 International Conference on Machine Learning, page 9, 2015.

[25]

R. Rico-Martinez and I. G. Kevrekidis.

Noninvertibility in Neural Networks.

In Proceedings of the 1993 IEEE International Conference on Neural Networks, pages 382–386, 1993.

[26]

R. Rico-Martínez, R. A. Adomaitis, and I. G. Kevrekidis.

Noninvertibility in neural networks.

Computers & Chemical Engineering, 24(11):2417–2433, 2000.

[27]

R. Rico-Martínez, J. S. Anderson, and I. G Kevrekidis.

Continuous-time nonlinear signal processing: a neural network based approach for gray box identification.

In Proceedings of IEEE Workshop on Neural Networks for Signal Processing, 1994.

[28]

R. Rico-Martínez and I. G Kevrekidis.

Nonlinear system identification using neural networks: dynamics and instabilities.

In A. B. Bulsari, editor, Neural Networks for Chemical Engineers, pages 409–442. Elsevier, 1995.

[29]

R. Rico-Martínez, I. G. Kevrekidis, and R. A. Adomaitis.

Noninvertible Dynamics in Neural Network Models.

Proceedings of the Twenty-Eighth Annual Conference on Information Sciences and Systems, pages 965–969, 1994.

[30]

R. Rico-Martínez, I. G. Kevrekidis, M. C. Kube, and J. L. Hudson.

Discrete- vs continuous-time nonlinear signal processing attractors, transitions and parallel implementation issues.

In American Control Conference, pages 1475–1479, 1993.

[31]

R. Rico-Martínez, K. Krischer, I. G. Kevrekidis, M. C. Kube, and J. L. Hudson.

Discrete- vs continuous-time nonlinear signal processing of Cu Electrodissolution Data.

Chemical Engineering Communications, 118(1):25–48, 1992.

ISBN: 0098644920.

[32]

M. Schmidt and H. Lipson.

Distilling free-form natural laws from experimental data.

Science, 324(5923):81–85, 2009.

[33]

R. K. Srivastava, K. Greff, and J. Schmidhuber.

Highway Networks.

In Proceedings of the International Conference on Machine Learning, 2015.

[34]

P. Toth, D. J. Rezende, A. Jaegle, S. Racanière, A. Botev, and I. Higgins.

Hamiltonian Generative Networks.

arXiv:1909.13789 [cs, stat], September 2019.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. M. Almeida. Hamiltonian systems: Chaos and quantization . Cambridge University Press, 1992.
2[2] Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian Variational Auto-Encoder. In Advances in Neural Information Processing Systems , volume 32, page 11, 2018.
3[3] B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert. Multi-level Residual Networks from Dynamical Systems View. In Proceedings of the International Conference on Machine Learning , February 2018.
4[4] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. In Advances in Neural Information Processing Systems . Curran Associates, Inc., 2018.
5[5] Weinan E. A Proposal on Machine Learning via Dynamical Systems. Communications in Mathematics and Statistics , 5(1):1–11, 2017.
6[6] R. González-García, R. Rico-Martínez, and I. G. Kevrekidis. Identification of distributed parameter systems: A neural net based approach. Computers & Chemical Engineering , 22(98):S 965–S 968, 1998.
7[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning . MIT Press, 2016.
8[8] K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. In Proceedings of the International Conference on Learning Representations , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On Learning Hamiltonian Systems from Data

Abstract

I Introduction

II General description

III Example: the nonlinear pendulum

III.1 Approximation using Gaussian processes

III.2 Approximation using an artificial neural network

IV Estimating Hamiltonian structure from observations

IV.1 A composite loss function for the joint learning of a transformation and a Hamiltonian

IV.2 Example: linear transformation of the pendulum

IV.3 Example: nonlinear transformation of the pendulum

IV.4 Example: constructing a Hamiltonian system from nonlinear, high-dimensional observations of q,pq,pq,p

V Conclusions

Acknowledgements.

References

IV.4 Example: constructing a Hamiltonian system from nonlinear, high-dimensional observations of $q,p$