Divergence of predictive model output as indication of phase transitions

Frank Sch\"afer; Niels L\"orch

arXiv:1812.00895·cond-mat.stat-mech·June 12, 2019

Divergence of predictive model output as indication of phase transitions

Frank Sch\"afer, Niels L\"orch

PDF

1 Repo

TL;DR

This paper proposes a novel method using divergence in predictive model outputs, like neural networks, to detect phase transitions in physical systems without prior phase information.

Contribution

It introduces a model-based divergence approach for identifying phase boundaries applicable to systems with arbitrary parameter dimensions.

Findings

01

Effective detection of phase transitions in Ising and Kuramoto-Hopf models.

02

Method does not require prior knowledge of phases.

03

Applicable to high-dimensional parameter spaces.

Abstract

We introduce a new method to identify phase boundaries in physical systems. It is based on training a predictive model such as a neural network to infer a physical system's parameters from its state. The deviation of the inferred parameters from the underlying correct parameters will be most susceptible and diverge maximally in the vicinity of phase boundaries. Therefore, peaks in the divergence of the model's predictions are used as indication of phase transitions. Our method is applicable for phase diagrams of arbitrary parameter dimension and without prior information about the phases. Application to both the two-dimensional Ising model and the dissipative Kuramoto-Hopf model show promising results.

Equations15

L = ⟨(δ p)^{2} ⟩,

L = ⟨(δ p)^{2} ⟩,

div (δ p) = n \sum \frac{\partial δ p _{n}}{\partial p _{n}}

div (δ p) = n \sum \frac{\partial δ p _{n}}{\partial p _{n}}

W (p_{pred} = x_{1} ∣ p_{0} = x_{2}) > W (p_{pred} = x_{3} ∣ p_{0} = x_{2})

W (p_{pred} = x_{1} ∣ p_{0} = x_{2}) > W (p_{pred} = x_{3} ∣ p_{0} = x_{2})

\Leftrightarrow ∣ y (x_{1}) - y (x_{2}) ∣ < ∣ y (x_{3}) - y (x_{2}) ∣.

H_{Ising}^{iso} = - J j = 1 \sum N k = 1 \sum N (σ_{j, k} σ_{j, k + 1} + σ_{j + 1, k} σ_{j, k}),

H_{Ising}^{iso} = - J j = 1 \sum N k = 1 \sum N (σ_{j, k} σ_{j, k + 1} + σ_{j + 1, k} σ_{j, k}),

H_{Ising}^{aniso} = - j = 1 \sum N k = 1 \sum N (J_{x} σ_{j, k} σ_{j, k + 1} + J_{y} σ_{j + 1, k} σ_{j, k}),

H_{Ising}^{aniso} = - j = 1 \sum N k = 1 \sum N (J_{x} σ_{j, k} σ_{j, k + 1} + J_{y} σ_{j + 1, k} σ_{j, k}),

\overset{φ}{˙}_{i}

\overset{φ}{˙}_{i}

+ V_{2} ⎩ ⎨ ⎧ ⟨ j, i ⟩ \sum ⟨ k, j ⟩ \sum [sin (2 φ_{j} - φ_{k} - φ_{i}) - sin (φ_{k} - φ_{i})]

+ ⟨ j, i ⟩ \sum ⟨ k, i ⟩ \sum sin (φ_{k} + φ_{j} - 2 φ_{i}) ⎭ ⎬ ⎫,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

frankschae/Vector-field-divergence-of-predictive-model-output-as-indication-of-phase-transitions
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Divergence of predictive model output as indication of phase transitions

Frank Schäfer

Department of Physics, University of Basel, Klingelbergstrasse 82, CH-4056 Basel, Switzerland

Niels Lörch

[email protected]

Department of Physics, University of Basel, Klingelbergstrasse 82, CH-4056 Basel, Switzerland

(March 7, 2024)

Abstract

We introduce a new method to identify phase boundaries in physical systems. It is based on training a predictive model such as a neural network to infer a physical system’s parameters from its state. The deviation of the inferred parameters from the underlying correct parameters will be most susceptible and diverge maximally in the vicinity of phase boundaries. Therefore, peaks in the divergence of the model’s predictions are used as indication of phase transitions. Our method is applicable for phase diagrams of arbitrary parameter dimension and without prior information about the phases. Application to both the two-dimensional Ising model and the dissipative Kuramoto-Hopf model show promising results.

I Introduction

Recent developments in machine learning demonstrating the surprising predictive power of deep neural networks Schmidhuber (2015); LeCun et al. (2015); Goodfellow et al. (2016) have led to great renewed interest in the field and its potential to revolutionize both industry and academic disciplines. Breakthrough achievements responsible for this optimism include unprecedented accuracy in image classification Krizhevsky et al. (2012), super-human skills at strategy games like go and chess Silver et al. (2016), or even creation of art Goodfellow et al. (2014).

This technology is being transferred to physics and has been applied e.g. in high-energy physics to interpret of experimental data at the LHC Kasieczka et al. (2017) and in astronomy to recover features in images of galaxies Schawinski et al. (2017). In particular, in quantum and statistical mechanics, there has been great interest to integrate machine learning with quantum technology Schuld et al. (2015); Biamonte et al. (2017), with successful design and control of experiments Melnikov et al. (2018); Fösel et al. (2018); Bukov et al. (2018), optimization of numerical algorithms Huang and Wang (2017); Liu et al. (2017) and efficient representation of physical systems with Boltzmann machines Carleo and Troyer (2017); Koch-Janusz and Ringel (2018).

Feed-forward neural networks have been proven highly efficient in labeling different phases of matter Carrasquilla and Melko (2017), even without knowing the correct labels beforehand van Nieuwenburg et al. (2017), manifesting a kind of unsupervised learning for the discovery of phase transitions Wang (2016). Further interesting techniques for this purpose include methods building on features of other predictive models such as support vector machines Liu et al. (2018) or on unsupervised clustering algorithms Wang (2016); Wetzel (2017); Hu et al. (2017); Ch’ng et al. (2018), e.g. principal component analysis (PCA), t-SNE or k-means clustering.

The schemes presented in Refs. van Nieuwenburg et al. (2017, 2018) predict phase transitions by retraining a neural network on a system with tentative phase labels many times, to accept the particular partition where the network achieves highest labeling accuracy as the prediction for the correct separation of phases. Refs. Broecker et al. (2017); Liu and van Nieuwenburg (2018) generalize this approach to two-dimensional phase diagrams, where this procedure becomes more costly. Ref. Huembeli et al. (2018) can find phase boundaries between two different phases, circumventing the cost of network retraining, by leveraging adversarial domain adaption Ganin et al. (2016) to make use of prior knowledge of correct phase labels in some well-understood area of phase space.

Here, we introduce a new method that can naturally predict arbitrary-dimensional phase diagrams without any prior knowledge of phase labels. It is economical in computational resources, as it only requires one training procedure, where any suitable predictive model is taught to infer from the state of a physical system its system parameters. The deviation of the inferred parameters from the underlying correct parameters can then be used to predict phase transitions: As the model’s predictions will tend to be most susceptible to the change of system parameters in the vicinity of phase boundaries, this is where the divergence of predictions will peak and thereby indicate phase transitions. As the method does not require any prior knowledge of the labels or even number of different phases, our algorithm constitutes an unsupervised learning scheme, while employing a supervised subroutine learning the labeled system parameters.

We will demonstrate the method on two test cases. Firstly, to start with a well-understood system, we apply it to the Ising model on a two-dimensional lattice, with the potential variation of anisotropic coupling in horizontal and vertical direction. As the second test case we choose to study an example showing a dissipative non-equilibrium phase transition. As these are generally harder to describe theoretically, it is especially important to develop new methods for a better understanding. In particular, we consider the Kuramoto-Hopf model, an effective model that was recently established Lauter et al. (2015) to describe the effective dynamics of weakly coupled nonlinear self-oscillators on a two-dimensional lattice, as may be experimentally implemented with optomechanical systems.

The structure of this article is as follows: In the subsequent section, we introduce our divergence-based learning scheme for uncovering phase transitions and explain its mechanism in theory. We will then review the two physical systems used as test cases and apply the scheme to them. Finally we discuss the potential and limitations of the method and give an outlook for potential future developments.

II Rationale for the Learning scheme

Our method requires to uniformly sample instances of the state $S$ of a physical system as a function of continuous system parameters $p_{0}$ . For example, $S(p_{0})$ could be a spin configuration of the Ising model sampled as a function of temperature and coupling strengths as the parameters.

As illustrated in Fig. 1(a), the ansatz of our model is as follows: A predictive model $m:S\to p_{\mathrm{pred}}$ , for example a neural network, is trained to produce predictions $p_{\mathrm{pred}}$ of the system parameters that minimize the expectation value of the loss function

[TABLE]

where $\delta p=p_{\mathrm{pred}}-p_{0}$ denotes the difference between the predictions and the correct labels 111While other choices of $L$ are possible, here we chose the generic Euclidian norm (mean squared error) of the difference..

For sufficiently precise models, the divergence

[TABLE]

will then generally have local maxima at the parameter values, where the system state is most susceptible to a change of system parameters. This susceptibility suggests that the system undergoes a phase transition at those parameters. We note that, as $\mathrm{div}({p_{0}})$ is constant, we could equivalently use the maxima of $\mathrm{div}({p_{\mathrm{pred}}})=\mathrm{div}({\delta p})+\mathrm{div}({p_{0}})$ .

The working of this scheme can be intuitively understood by the following considerations. The best strategy to minimize the mean squared errors within a patch of parameter space that looks indistinguishable to the predictive model, is to place the prediction at the center of mass of that patch. Even if such a patch can be resolved by the model to some extent, there will still be a bias at the edges of the patch towards the center, as long as the model’s precision and confidence are not absolute. Therefore, two neighboring patches that are very well distinguishable from each other will lead to diverging predictions in the vicinity of the border separating these patches and therefore the maximum of the divergence described above. This argument is illustrated in Fig. 1(b) for the limiting case of low resolution, where the predictive model is incapable of resolving the parameters within a phase, but is able to discriminate between states of different phases.

In the opposite limit of a predictive model with extremely high resolution, we quantify the expected value of $\delta p$ after making a few idealized assumptions. While this quantification helps to better understand our scheme, one should also keep in mind that it describes a particular idealized situation.

We assume that the phase transitions of the considered system can be described with an order parameter $y(p)$ , which in the following will be a real number for simplicity. We further assume that the predictive model essentially estimates this order parameter $y(p)$ to reconstruct the physical system’s parameters, i.e. make the prediction $p_{\mathrm{pred}}$ . As illustrated in Fig. 1(c), let us consider three neighboring points $x_{1}$ , $x_{2}$ and $x_{3}$ on an equidistant grid in one parameter dimension. As the sampling in parameter space is uniform, the probability $W$ of each of these points is the same, i.e. $W(p_{0}=x_{1})=W(p_{0}=x_{2})=W(p_{0}=x_{3})$ . If the resolution of the model is on the order of the distance between these points, we neglect the probabilities of the model to err by more than one point on the given grid so that $\sum_{n=1}^{3}W(p_{\mathrm{pred}}=x_{n}|p_{0}=x_{2})=1$ for the probability conditioned on $p_{0}=x_{2}$ .

The probabilistic sampling of the finite dimensional physical system gives rise to randomness, furthermore the predictive model will typically have random errors. Therefore, the estimation $y_{E}(S)$ of the order parameter as inferred by the model for different samples at a parameter $p_{0}$ will generally vary probabilistically around the expectation value $\langle y(p_{0})\rangle$ . For a simplified description, we will assume the deviation $y_{E}-\langle y(p_{0})\rangle$ to be symmetric around zero with lower probability density for higher deviations and to follow independent identical distributions at all points 222The first condition corresponds to the reasonable assumption of an unbiased estimator. The second condition is necessary to establish a concrete analytical model but could be altered to develop a more general description..

Under these conditions the predictive model will bias its predictions in the direction where the order parameter varies less

[TABLE]

Obviously, this leads to diverging $\delta p$ at a first-order phase transition, where $y$ jumps discontinuously. Taking into account a finite system size or considering second-order phase transitions, the direction of $\delta p$ will depend on the sign of the second derivative of $y$ , which will generally change when going from one phase to another. Depending on the particular transition, this sign can also change a second time, leading to another peak in the divergence. Therefore further peaks in divergence can be expected and each peak is indicator, but not a proof of a phase transition.

These arguments show the potential of our method in idealized limiting cases of high and low precision. To evaluate how this translates to a non-ideal setting and thereby test the performance of our scheme, we apply it to two physical systems, which are introduced in the next section.

III Test cases

III.1 The Ising model

As a basic test case to demonstrate our method, we apply it to the well-understood Ising model Ising (1925). Perhaps its most popular version is the case of ferromagnetically coupled spins on a two-dimensional lattice without bias fields and isotropic coupling. It is described by the Hamiltonian

[TABLE]

where the $\sigma_{jk}$ describe spin configurations that can take on the values $\sigma_{jk}=\pm 1$ and $J>0$ denotes the coupling energy between neighboring spins. For temperatures above the critical value, the magnetization $M=\sum_{j,k}\sigma_{jk}/N^{2}$ vanishes in the thermodynamic limit $N\to\infty$ . At $T_{c}$ the system undergoes a phase transition, with a net magnetization building up, that reaches $|M|=1$ at $T=0$ .

We will also consider the natural generalization described by the Hamiltonian

[TABLE]

allowing for different coupling strength $J_{x}$ and $J_{y}$ in horizontal and vertical direction, as well as negative values for the coupling. In this anti-ferromagnetic case, it is energetically better to anti-align neighboring spins. The threshold for the absolute value of the coupling strength above which the ordered phase emerges, now depends on both $J_{x}$ and $J_{y}$ and was derived analytically Onsager (1944) to be $J_{y}/T=-\log(\tanh(J_{x}/T))/2$ for $J_{x},J_{y}>0$ , where we set the Boltzmann constant $k_{B}=1$ . The other three sectors, where the sign of at least one of the couplings is negative, behave analogously with adjusted order parameter $M=\sum_{j,k}\sigma_{jk}\mathrm{sign}(J_{x})\mathrm{sign}(J_{y})/N^{2}$ , measuring anti-alignment instead of alignment of spins, where appropriate. The different couplings and the corresponding ideal configurations at $T=0$ are illustrated in Fig. 2(a).

Numerically, we generate samples of the Ising model on a $512\times 512$ lattice using the Metropolis algorithm Metropolis et al. (1953): The lattice is initialized in a random configuration and then updated many times by drawing a random spin, which is then flipped with probability $\min\left(1,\mathrm{e}^{-\Delta E/T}\right)$ , where $\Delta E$ is the energy difference resulting from the considered flip. To ensure that the system is sufficiently thermalized we sweep the complete lattice $10^{5}$ times, where each point is updated once per sweep.

III.2 The Kuramoto-Hopf model

To try our method on a new, relatively unexplored system, we choose the Kuramoto-Hopf model Lauter et al. (2015) which describes the dynamics of weakly-coupled limit-cycle oscillators on a square lattice. It could be experimentally implemented e.g. with optomechanical self-oscillators, for which it was first derived Heinrich et al. (2011). In the situation of weak coupling and linear nearest-neighbor interactions, the Hopf equations for limit cycle oscillators with amplitude-dependent frequency give rise to the effective equations

[TABLE]

for the phases $\varphi_{i}$ of the oscillators. Equation (6) describes both nearest-neighbor couplings $\propto C,V_{1}$ and a more complex type of coupling including both nearest-neighbor and next-to-nearest couplings $\propto V_{2}$ . The structure of the coupling is illustrated in Fig. 2(b), where the indices appearing in Eq. (6) are colored according to their roles.

The interplay of the different terms leads to rich phase patterns. For example, at fixed $V_{2}=0$ , increasing the ratio $V_{1}/C$ interpolates from the Kuramoto-Sakaguchi model Sakaguchi and Kuramoto (1986) to the Kuramoto model Kuramoto (1975); Acebrón et al. (2005) at $V_{1}\gg C$ . In the present article, we focus on varying the next-to-nearest neighbor coupling parameter $V_{2}/C$ at constant $V_{1}/C=5$ . Thereby, we interpolate between a phase that has a well-defined continuum limit at $V_{2}=0$ , and a phase where the next-nearest-neighbor terms become important. There, $\pi$ -defect configurations, where the phase of a single oscillator within the lattice is opposite to its direct neighbors become stable Lauter et al. (2015). Examples of the phase patterns in the different phases are shown in Fig. 2(c).

Samples of the Kuramoto-Hopf model are obtained by propagating Eq. (6) until time $t=500/C$ with random initial states on a $128\times 128$ grid using the DifferentialEquations.jl package of julia Bezanson et al. (2017); Rackauckas and Nie (2017) with the CVode Backward Differentiation Formula (BDF). Within the Newton iteration for this implicit method, we use GMRES as the linear solver Hindmarsh et al. (2005).

IV learning scheme applied to the systems

While our method does not require a particular predictive model, in this work we use neural networks. Their basic architecture is of the form $m=\Pi_{n=1}^{L}K_{n}A_{n}$ for an $L$ -layer network, with linear functions $A_{n}$ and nonlinear functions $K_{n}$ of the features. As the couplings of both considered models act only locally, cf. Fig. 2, the resulting samples have a local structure just like ordinary images. Therefore we use convolutional neural networks, where the first few $A_{n}$ perform convolutions, and the later layers have all-to-all connectivity. The $K_{n}$ are rectifiers, i.e. $K_{n}$ applied to a vector $v$ changes the $j$ th entry according to $v^{j}\to\max(0,v^{j}+b^{j}_{n})$ , where $b$ is a bias term learned by the model to fit the data. For this fit, we use backpropagation with the Adam optimizer Kingma and Ba (2014) to train the model in a series of epochs, where the network sees each training image once per epoch. The learning rate of the optimizer is halved every five epochs. Our particular network consists of three convolutional layers followed by three fully-connected layers, with rectifiers activation functions in the hidden layers.

The samples for each data point are split in two sets of equal size: a training set for training the network and a test set, on which it is evaluated and the predictions are calculated. To obtain error bars, we repeated this process on several data sets. The average are collected in Fig. 3: To give an overview of the learning procedure, panel (a) shows the loss function $L$ from Eq. (1) for both Ising models and the Hopf-Kuramoto model as a function of passed epochs.

Panel (b) shows the network’s output after two epochs for the isotropic Ising model from Eq. (4). Samples as a function of $J/T$ are obtained on a linearly spaced grid of 80 points on the interval $[0,2]$ , where each point was sampled 440 times. A clear peak of the divergence (Eq. (2)) emerges close to the analytically expected value for the Ising phase transition at $J/T\approx 0.44$ . A closer inspection of the predictions $J_{\mathrm{pred}}(J)$ suggest that at this epoch, the network’s behavior is well described by the situation explained in Fig. 1(b), i.e. the network essentially recognizes whether the sample is in the ferromagnetic or in the paramagnetic phase, but cannot resolve well within the phases. Panel (c) shows the predictions after the final training epoch, where the network has learned to resolve the parameters even within a phase in the region $J<T_{c}$ . The higher resolution strongly decreases the size of the peak, but the maximum is still in the vicinity of the phase transition, as we would expect from the arguments illustrated by Fig. 1(c). Furthermore, a second peak arises at around $J=T$ , which is approximately where the magnetization (shown in the same plot) starts to saturate. After saturation the states again become indistinguishable to the model.

Panel (d) shows the output of the network after two training epochs for the Ising model from Eq. (5) with independent couplings $(J_{x},J_{y})$ sampled on a $40\times 40$ grid with each parameter ranging from $-1.5$ to $+1.5$ . The divergence peak is in good agreement with the Onsager result given below Eq. (5), which is represented by the yellow lines. The arrows show scaled deviations $\delta\vec{J}$ to indicate their direction. The predictions themselves are again clustered at the center of the different phases, similar to the situation expected from Fig. 1(b).

For the Kuramoto-Hopf model, we sampled a line in the phase diagram at $V_{1}/C=5$ and $V_{2}/C$ logarithmically varying between $V_{2}/C=0.1$ and $V_{2}/C=10$ . We discretized the line by 100 grid points and we computed 70 independent samples per grid point. The $\pi$ -defects mentioned earlier become stable at around $V_{2}/C\approx 1$ Lauter et al. (2015) manifesting a phase transition. Panels (e) and (f) show the output of the network after, respectively, 5 and 60 training epochs. At 5 epochs, there is a clear peak centered at approximately $V_{2}/C\approx 2$ , after 60 epochs it becomes broader and starts at slightly lower values. While the phase transition is captured, the exact value of the transition is less precise than in case of the Ising model. Inspecting the samples around $V_{2}/C\gtrsim 1$ , we find that only a few of them actually show defects due to the finite size of the lattice.

V Discussion and Outlook

For both of the systems tested, the learning scheme showed promising results and found the respective phase transitions without any prior knowledge. During the training procedure, as the resolution of the neural network increases, more and more structure in the phase diagram emerges. It is important to avoid over-fitting to the training data, as this may lead to spurious structures emerging in $\delta p$ related to the features of the particular training data. Obviously over-fitting can be avoided by obtaining more training data, but also by data augmentation, where suitable random transformations are applied each time a sample is revisited. In our case, we made use of the symmetries of the systems and applied a series of flips, rotations by multiples of 90 degrees, as well as random translations along both the $x$ and $y$ axis with periodic boundary conditions. Finally, the training should be stopped early enough, as inferred from comparing the difference between validation loss and training loss.

To achieve a higher resolution of the phase boundaries, it is helpful to have more training data. In practice the amount of data is limited by the availability of computational resources to generate them. Assuming therefore a fixed total number of samples, the number of samples per grid point scales as $1/\Delta^{M}$ with distance $\Delta$ between adjacent grid points of sampled system parameters on a $M$ -dimensional grid. As the divergence is numerically obtained via a difference quotient with $\Delta$ in the denominator, the divergence’s confidence interval scales in total as $1/\Delta^{M/2+1}$ . Furthermore, the divergence scales with the magnitude of $\delta p$ , so that a higher resolution of the predictive model reduces the signal. As the noise may not decrease accordingly and may even be dominated by issues such as the inherent stochasticity of the training procedure, a higher resolution for $p_{\mathrm{pred}}$ does not necessarily lead to a better resolution of the phase transition.

These considerations should be kept in mind when the method introduced here will be applied to different systems. Several possibilities arise for further development of the method in the future. For example, to address the trade-off in resolution of $p_{\mathrm{pred}}$ and the actual phase transition, one could add a term to the loss function which diverges at $p_{\mathrm{pred}}=0$ in order to prevent the predictive model from decreasing $p_{\mathrm{pred}}$ too much. In the limit where the network’s resolution is already very high, it may be advantageous to view the different grid points of the physical parameters as categories to be learned rather than parameters to be continuously estimated. The optimal configuration with respect to these changes may differ between different systems. It is also worth exploring other functions (besides the divergence) to quantify how the deviations $\delta\vec{p}$ point in different directions near phase boundaries. Finally, one could train a neural network to infer the system parameters of samples, but ignore the output layer, and rather use the last hidden layers as a representation of the data. This representation can then be used as an input for clustering algorithms.

As our scheme is easy to use and allows for the assignment of phase labels without any prior knowledge, it can be used as a starting points for other methods that do require some initial knowledge of the phase diagram’s structure, such as Huembeli et al. (2018). Furthermore, once the phase labels are determined, algorithms that visually interpret the decision making of neural networks Zhang and Zhu (2018) can be used to try and retrieve the relevant features distinguishing the different phases.

VI Conclusion

In summary, we have introduced the divergence-based learning scheme of phase transition and provided a theoretical description of the method as a guide to its application, which was successfully demonstrated on both an equilibrium system, the Ising model, and a non-equilibrium system, the Kuramoto-Hopf-model. As our method can be built upon any predictive model and only requires the calculation of the divergence, it is simple to use and implement with existing machine learning libraries. Because that predictive model needs to be trained only once on the training set, the method is economical in computational resources. It is applicable in the generic situation where no prior information of the phases is known and for phase diagrams of arbitrary parameter dimension. Due to these properties, we hope it will become a useful part of physicists’ toolbox to discover novel phase transitions.

VII Acknowledgments

We would like to thank C. Bruder and J. Lehmann for helpful discussions. This work was financially supported by the Swiss National Science Foundation (SNSF) and the NCCR Quantum Science and Technology. Calculations were performed at sciCORE (scicore.unibas.ch) scientific computing core facility at University of Basel.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Schmidhuber (2015) J. Schmidhuber, Neural networks 61 , 85 (2015) .
2Le Cun et al. (2015) Y. Le Cun, Y. Bengio, and G. Hinton, Nature 521 , 436 (2015).
3Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning , Vol. 1 (MIT press Cambridge, 2016).
4Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Advances in neural information processing systems (2012) pp. 1097–1105.
5Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. , Nature 529 , 484 (2016) .
6Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, in Advances in neural information processing systems (2014) pp. 2672–2680.
7Kasieczka et al. (2017) G. Kasieczka, T. Plehn, M. Russell, and T. Schell, Journal of High Energy Physics 2017 , 6 (2017) .
8Schawinski et al. (2017) K. Schawinski, C. Zhang, H. Zhang, L. Fowler, and G. K. Santhanam, Monthly Notices of the Royal Astronomical Society: Letters 467 , L 110 (2017) .