TL;DR
This paper explores neural network methods for approximating conditional expectations in Bermudan option pricing, demonstrating convergence and numerical efficiency as an alternative to traditional regression techniques.
Contribution
It proves the convergence of the Longstaff and Schwartz algorithm when replacing regression with neural networks and shows their practical efficiency.
Findings
Neural networks can effectively approximate conditional expectations in high-dimensional settings.
The neural network approach converges under the same conditions as classical regression methods.
Numerical experiments confirm the efficiency of neural networks over traditional regression in option pricing.
Abstract
The pricing of Bermudan options amounts to solving a dynamic programming principle, in which the main difficulty, especially in high dimension, comes from the conditional expectation involved in the computation of the continuation value. These conditional expectations are classically computed by regression techniques on a finite dimensional vector space. In this work, we study neural networks approximations of conditional expectations. We prove the convergence of the well-known Longstaff and Schwartz algorithm when the standard least-square regression is replaced by a neural network approximation. We illustrate the numerical efficiency of neural networks as an alternative to standard regression methods for approximating conditional expectations on several numerical examples.
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 11.96 ( 0.07) | 11.97 ( 0.06) | 11.98 ( 0.057) |
| 2 | 128 | 11.96 ( 0.07) | 11.97 ( 0.056) | 11.97 ( 0.061) |
| 2 | 512 | 11.95 ( 0.076) | 11.95 ( 0.08) | 11.96 ( 0.071) |
| 4 | 32 | 11.93 ( 0.083) | 11.94 ( 0.09) | 11.96 ( 0.075) |
| 4 | 128 | 11.89 ( 0.145) | 11.93 ( 0.097) | 11.95 ( 0.081) |
| 4 | 512 | 11.86 ( 0.127) | 11.93 ( 0.096) | 11.94 ( 0.072) |
| 8 | 32 | 11.89 ( 0.12) | 11.93 ( 0.117) | 11.95 ( 0.096) |
| 8 | 128 | 11.88 ( 0.126) | 11.92 ( 0.11) | 11.94 ( 0.102) |
| 8 | 512 | 11.85 ( 0.129) | 11.9 ( 0.163) | 11.92 ( 0.111) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 4.55 ( 0.038) | 4.56 ( 0.041) | 4.56 ( 0.031) |
| 2 | 128 | 4.55 ( 0.032) | 4.56 ( 0.04) | 4.56 ( 0.038) |
| 2 | 512 | 4.54 ( 0.04) | 4.55 ( 0.033) | 4.55 ( 0.041) |
| 4 | 32 | 4.52 ( 0.044) | 4.54 ( 0.04) | 4.55 ( 0.036) |
| 4 | 128 | 4.52 ( 0.044) | 4.54 ( 0.033) | 4.55 ( 0.041) |
| 4 | 512 | 4.5 ( 0.046) | 4.54 ( 0.042) | 4.54 ( 0.045) |
| 8 | 32 | 4.52 ( 0.043) | 4.54 ( 0.049) | 4.55 ( 0.052) |
| 8 | 128 | 4.51 ( 0.046) | 4.53 ( 0.045) | 4.54 ( 0.045) |
| 8 | 512 | 4.47 ( 0.181) | 4.51 ( 0.051) | 4.52 ( 0.149) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 2.91 ( 0.027) | 2.92 ( 0.023) | 2.93 ( 0.019) |
| 2 | 128 | 2.91 ( 0.025) | 2.93 ( 0.021) | 2.94 ( 0.024) |
| 2 | 512 | 2.9 ( 0.025) | 2.93 ( 0.023) | 2.94 ( 0.027) |
| 4 | 32 | 2.9 ( 0.027) | 2.92 ( 0.029) | 2.94 ( 0.021) |
| 4 | 128 | 2.9 ( 0.033) | 2.92 ( 0.023) | 2.93 ( 0.027) |
| 4 | 512 | 2.89 ( 0.028) | 2.91 ( 0.033) | 2.93 ( 0.033) |
| 8 | 32 | 2.9 ( 0.02) | 2.92 ( 0.029) | 2.94 ( 0.024) |
| 8 | 128 | 2.9 ( 0.036) | 2.92 ( 0.026) | 2.94 ( 0.026) |
| 8 | 512 | 2.88 ( 0.042) | 2.91 ( 0.033) | 2.92 ( 0.034) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 128 | 2.52 ( 0.025) | 2.57 ( 0.019) | 2.61 ( 0.021) |
| 2 | 256 | 2.51 ( 0.027) | 2.57 ( 0.018) | 2.61 ( 0.017) |
| 2 | 512 | 2.5 ( 0.011) | 2.56 ( 0.021) | 2.61 ( 0.023) |
| 4 | 128 | 2.51 ( 0.03) | 2.59 ( 0.023) | 2.78 ( 0.045) |
| 4 | 256 | 2.51 ( 0.031) | 2.57 ( 0.018) | 2.75 ( 0.023) |
| 4 | 512 | 2.49 ( 0.02) | 2.55 ( 0.025) | 2.65 ( 0.035) |
| 8 | 128 | 2.51 ( 0.018) | 2.58 ( 0.022) | 2.76 ( 0.051) |
| 8 | 256 | 2.51 ( 0.026) | 2.57 ( 0.021) | 2.75 ( 0.038) |
| 8 | 512 | 2.46 ( 0.135) | 2.56 ( 0.021) | 2.65 ( 0.056) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 128 | 2.5 ( 0.008) | 2.52 ( 0.005) | 2.52 ( 0.005) |
| 2 | 256 | 2.5 ( 0.012) | 2.52 ( 0.005) | 2.52 ( 0.01) |
| 2 | 512 | 2.49 ( 0.015) | 2.51 ( 0.007) | 2.52 ( 0.009) |
| 4 | 128 | 2.5 ( 0.006) | 2.52 ( 0.006) | 2.53 ( 0.003) |
| 4 | 256 | 2.5 ( 0.009) | 2.51 ( 0.007) | 2.52 ( 0.005) |
| 4 | 512 | 2.49 ( 0.008) | 2.51 ( 0.011) | 2.52 ( 0.014) |
| 8 | 128 | 2.5 ( 0.007) | 2.53 ( 0.011) | 2.54 ( 0.008) |
| 8 | 256 | 2.49 ( 0.012) | 2.52 ( 0.011) | 2.53 ( 0.007) |
| 8 | 512 | 2.46 ( 0.154) | 2.49 ( 0.053) | 2.51 ( 0.015) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 3.08 ( 0.023) | 3.09 ( 0.023) | 3.1 ( 0.028) |
| 2 | 128 | 3.08 ( 0.024) | 3.09 ( 0.024) | 3.1 ( 0.027) |
| 2 | 512 | 3.08 ( 0.032) | 3.09 ( 0.023) | 3.09 ( 0.03) |
| 4 | 32 | 3.07 ( 0.032) | 3.09 ( 0.031) | 3.1 ( 0.027) |
| 4 | 128 | 3.07 ( 0.03) | 3.09 ( 0.027) | 3.09 ( 0.027) |
| 4 | 512 | 3.06 ( 0.038) | 3.08 ( 0.031) | 3.09 ( 0.03) |
| 8 | 32 | 3.07 ( 0.032) | 3.09 ( 0.028) | 3.09 ( 0.033) |
| 8 | 128 | 3.06 ( 0.035) | 3.08 ( 0.026) | 3.1 ( 0.027) |
| 8 | 512 | 3.06 ( 0.053) | 3.07 ( 0.053) | 3.08 ( 0.038) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 2.15 ( 0.018) | 2.19 ( 0.019) | 2.21 ( 0.02) |
| 2 | 128 | 2.16 ( 0.016) | 2.21 ( 0.015) | 2.25 ( 0.021) |
| 2 | 512 | 2.15 ( 0.017) | 2.21 ( 0.014) | 2.26 ( 0.017) |
| 4 | 32 | 2.16 ( 0.018) | 2.21 ( 0.015) | 2.26 ( 0.017) |
| 4 | 128 | 2.16 ( 0.021) | 2.24 ( 0.024) | 2.43 ( 0.026) |
| 4 | 512 | 2.15 ( 0.018) | 2.2 ( 0.025) | 2.31 ( 0.026) |
| 8 | 32 | 2.17 ( 0.028) | 2.21 ( 0.02) | 2.28 ( 0.023) |
| 8 | 128 | 2.16 ( 0.026) | 2.24 ( 0.025) | 2.41 ( 0.032) |
| 8 | 512 | 2.14 ( 0.064) | 2.19 ( 0.031) | 2.29 ( 0.044) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 2.16 ( 0.008) | 2.17 ( 0.008) | 2.18 ( 0.009) |
| 2 | 128 | 2.16 ( 0.009) | 2.17 ( 0.008) | 2.17 ( 0.007) |
| 2 | 512 | 2.15 ( 0.01) | 2.17 ( 0.007) | 2.17 ( 0.005) |
| 4 | 32 | 2.17 ( 0.008) | 2.17 ( 0.008) | 2.18 ( 0.007) |
| 4 | 128 | 2.16 ( 0.012) | 2.17 ( 0.008) | 2.18 ( 0.007) |
| 4 | 512 | 2.15 ( 0.014) | 2.16 ( 0.01) | 2.16 ( 0.01) |
| 8 | 32 | 2.16 ( 0.011) | 2.18 ( 0.009) | 2.18 ( 0.006) |
| 8 | 128 | 2.16 ( 0.015) | 2.17 ( 0.007) | 2.18 ( 0.007) |
| 8 | 512 | 2.14 ( 0.022) | 2.15 ( 0.056) | 2.16 ( 0.015) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 25.97 ( 0.117) | 25.95 ( 0.141) | 25.94 ( 0.133) |
| 2 | 128 | 25.95 ( 0.11) | 25.95 ( 0.126) | 26.02 ( 0.113) |
| 2 | 512 | 25.92 ( 0.104) | 25.96 ( 0.116) | 26.01 ( 0.153) |
| 4 | 32 | 25.83 ( 0.132) | 25.97 ( 0.146) | 26.02 ( 0.139) |
| 4 | 128 | 25.76 ( 0.203) | 25.91 ( 0.162) | 25.99 ( 0.162) |
| 4 | 512 | 25.63 ( 0.238) | 25.85 ( 0.181) | 25.94 ( 0.146) |
| 8 | 32 | 25.72 ( 0.185) | 25.91 ( 0.134) | 25.96 ( 0.169) |
| 8 | 128 | 25.61 ( 0.251) | 25.84 ( 0.186) | 25.93 ( 0.143) |
| 8 | 512 | 25.49 ( 0.265) | 25.76 ( 0.223) | 25.83 ( 0.2) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 128 | 68.99 ( 0.179) | 69.26 ( 0.164) | 69.42 ( 0.169) |
| 2 | 256 | 69.07 ( 0.149) | 69.42 ( 0.125) | 69.45 ( 0.138) |
| 2 | 512 | 69.11 ( 0.194) | 69.43 ( 0.18) | 69.51 ( 0.167) |
| 4 | 128 | 68.91 ( 0.365) | 69.29 ( 0.334) | 69.55 ( 0.339) |
| 4 | 256 | 68.72 ( 0.358) | 69.24 ( 0.341) | 69.5 ( 0.369) |
| 4 | 512 | 68.54 ( 0.548) | 69.17 ( 0.356) | 69.34 ( 0.359) |
| 8 | 128 | 68.59 ( 0.613) | 69.32 ( 0.348) | 69.71 ( 0.497) |
| 8 | 256 | 68.57 ( 0.797) | 69.25 ( 0.564) | 69.4 ( 0.484) |
| 8 | 512 | 68.32 ( 1.444) | 69.01 ( 0.738) | 69.49 ( 0.487) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 128 | 68.85 ( 0.074) | 68.96 ( 0.095) | 69.01 ( 0.119) |
| 2 | 256 | 68.87 ( 0.1) | 69.0 ( 0.143) | 69.07 ( 0.146) |
| 2 | 512 | 68.82 ( 0.082) | 69.05 ( 0.128) | 69.19 ( 0.136) |
| 4 | 128 | 68.84 ( 0.221) | 69.28 ( 0.153) | 69.41 ( 0.211) |
| 4 | 256 | 68.75 ( 0.342) | 69.14 ( 0.296) | 69.38 ( 0.342) |
| 4 | 512 | 68.7 ( 0.426) | 69.05 ( 0.317) | 69.35 ( 0.254) |
| 8 | 128 | 68.81 ( 0.277) | 69.28 ( 0.291) | 69.64 ( 0.22) |
| 8 | 256 | 68.57 ( 0.512) | 69.34 ( 0.378) | 69.65 ( 0.414) |
| epochs=1 | epochs=5 | epochs=10 | ||
|---|---|---|---|---|
| 2 | 32 | 1.69 ( 0.017) | 1.7 ( 0.017) | 1.7 ( 0.016) |
| 2 | 128 | 1.69 ( 0.017) | 1.7 ( 0.019) | 1.7 ( 0.019) |
| 2 | 512 | 1.69 ( 0.019) | 1.69 ( 0.019) | 1.69 ( 0.018) |
| 4 | 32 | 1.69 ( 0.022) | 1.69 ( 0.017) | 1.7 ( 0.018) |
| 4 | 128 | 1.69 ( 0.024) | 1.69 ( 0.02) | 1.7 ( 0.016) |
| 4 | 512 | 1.68 ( 0.025) | 1.69 ( 0.022) | 1.69 ( 0.022) |
| 8 | 32 | 1.69 ( 0.023) | 1.69 ( 0.02) | 1.69 ( 0.019) |
| 8 | 128 | 1.68 ( 0.03) | 1.69 ( 0.022) | 1.69 ( 0.02) |
| 8 | 512 | 1.68 ( 0.03) | 1.68 ( 0.041) | 1.68 ( 0.053) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Neural network regression
for Bermudan option pricing
Bernard Lapeyre Université Paris-Est, Cermics (ENPC), INRIA, F-77455 Marne-la-Vallée, France
email: [email protected]
Jérôme Lelong Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
email: [email protected]
Abstract
The pricing of Bermudan options amounts to solving a dynamic programming principle, in which the main difficulty, especially in high dimension, comes from the conditional expectation involved in the computation of the continuation value. These conditional expectations are classically computed by regression techniques on a finite dimensional vector space. In this work, we study neural networks approximations of conditional expectations. We prove the convergence of the well-known Longstaff and Schwartz algorithm when the standard least-square regression is replaced by a neural network approximation, assuming an efficient algorithm to compute this approximation. We illustrate the numerical efficiency of neural networks as an alternative to standard regression methods for approximating conditional expectations on several numerical examples.
Key words: Bermudan options, optimal stopping, regression methods, deep learning, neural networks.
1 Introduction
Solving the backward recursion involved in the computation american option prices has been a challenging problem for years and various approaches have been proposed to approximate its solution. The real difficulty lies in the computation of the conditional expectation at each time step of the recursion. If we were to classify the different approaches, we could say that there are regression based approaches (see Tilley (1993); Carriere (1996); Tsitsiklis and Roy (2001); Broadie and Glasserman (2004)) and quantization approaches (see Bally and Pages (2003); Bronstein et al. (2013)). We refer to Bouchard and Warin (2012) and Pagès (2018) for an in depth survey of the different techniques to price Bermudan options.
Among all these available algorithms to compute american option prices using the dynamic programming principle, the one proposed by Longstaff and Schwartz (2001) has the favour of many practitioners. Their approach is based on iteratively selecting an optimal policy. Here, we propose and analyse a version of this algorithm which uses neural networks in order to compute an approximation of the conditional expectation and then to obtain an optimal exercising policy.
The use of neural network for the computation of American option prices is not new but we are aware of no work specifically devoted to its use for LS-style algorithms (LS for Longstaff and Schwartz (2001)). In Haugh and Kogan (2004), the authors used neural networks in numercial experiments to price American options through the dynamic programing equation on the value function. This led them to a Tsitsiklis and Roy (2001)-type algorithm which is different from LS-type algorithm, studied in this paper, which involve only the optimal stopping policy. Kohler et al. (2010) used neural networks to price American options but they also used the dynamic programing equation on the value function. Moreover they used new samples of the whole path of the underlying process at each time step to prove the convergence. In our approach, we use a neural network inspired modification of the original Longstaff-Schwartz algorithm and we draw a set of samples with the distribution of before starting and we use these very same samples at each time step. This saves a lot of computational time by avoiding a very costly resimulation at each time step, which very much improves the efficiency of our approach. Deep learning was also used in the context of optimal stopping by Becker et al. (2019a, b) to parametrize the optimal policy.
Now, we describe the framework of our study. We fix some finite time horizon and a filtered probability space modeling a financial market with being the trivial algebra. We assume that the short interest rate is modeled by an adapted process and that is an associated risk neutral measure. We consider a Bermudan option with exercising dates and discounted payoff if exercised at time . For convenience, we add [math] and to the exercising dates. This is definitely not a requirement of the method we propose here but it makes notation lighter and avoids to deal with the purely European part involved in the Bermudan option. We assume that the discrete time discounted payoff process is adapted to the filtration and that .
In a complete market, if denote the expectation under the risk neutral probability, standard arbitrage pricing arguments allows to define the discounted value of the Bermudan option at times by
[TABLE]
Using the Snell enveloppe theory, the sequence can be proved to be given by the following dynamic programing equation
[TABLE]
This equation can be rewritten in term of optimal policy. Let be the smallest optimal policy after time — the smallest stopping time reaching the supremum in (1) — then
[TABLE]
All these methods based on the dynamic programming principle either as value iteration (2) or policy iteration (3) require a Markovian setting to be implemented such that the conditional expectation knowing the whole past can be replaced by the conditional expectation knowing only the value of a Markov process at the current time. We assume that the discounted payoff process writes , for any , where is an adapted Markov process taking values in . Hence, the conditional expectation involved in (3) simplifies into and can therefore be approximated by a standard least square method.
Note that this setting allows to consider most standard financial models. For local volatility models, the process is typically defined as , where is the price of an asset and the instantaneous interest rate (only when the interest rate is deterministic). In the case of stochastic volatility models, also includes the volatility process , . Some path dependent options can also fit in this framework at the expense of increasing the size of the process . For instance, in the case of an Asian option with payoff with , one can define as and then the Asian option can be considered as a vanilla option on the two dimensional but non tradable assets .
Once the Markov process is identified, the conditional expectations can be written
[TABLE]
where solves the following minimization problem
[TABLE]
with being the set of all measurable functions such that . The real challenge comes from properly approximating the space by a finite dimensional space: one typically uses polynomials or local bases (see Gobet et al. (2005); Bouchard and Warin (2012)) and in any case it always boils down to a linear regression. In this work, we use neural networks to approximate in (4). The main difference between neural networks and the regression approaches commonly used comes from the non linearity of neural networks, which also makes their strength. Note that the set of neural networks with a fixed number of layers and neurons is obviously not a vector space and not even convex. Through neural networks, this paper investigates the effects of using non linear approximations of conditional expectations in the Longstaff Schwartz algorithm.
The paper is organized as follows. In Section 2, we start with some preliminaries on neural networks and recall the universal approximation theorem. Then, in Section 3, we describe our algorithm, whose convergence is studied in Section 4. Finally, we present some numerical results in Section 5.
2 Preliminaries on deep neural network
Deep Neural networks (DNN) aim at approximating (complex non linear) functions defined on finite-dimensional spaces, and in contrast with the usual additive approximation theory built via basis functions, like polynomials, they rely on composition of layers of simple functions. The relevance of neural networks comes from the universal approximation theorem and the Kolmogorov-Arnold representation theorem (see Arnold (2009); Kolmogorov (1956); Cybenko (1989); Hornik (1991); Pinkus (1999)), and this has shown to be successful in numerous practical applications.
We consider the feed forward neural network — also called multilayer perceptron — for the approximation of the continuation value at each time step. From a mathematical point view, we can model a DNN by a non linear function
[TABLE]
where typically writes as function compositions. Let be an integer, we write
[TABLE]
where for , are affine functions
[TABLE]
with , and . In our setting, we have and . The function is often called the activation function and is applied component wise. The number of rows of the matrix is usually interpreted as the number of neurons of layer . For the sake of simpler notation, we embed all the parameters of the different layers in a unique high dimensional parameter and with .
Let be fixed in the following, we introduce the set of all DNN of the above form. Now, we need to restrict the maximum number of neurons per layer. Let , , we denote by the set of neural networks with at most neurons per hidden layer and layers and bounded parameters. More precisely, we pick an increasing sequence of positive real numbers such that . We introduce the set
[TABLE]
Then, is defined by
[TABLE]
and we have . An element of with be denoted by with . Note that the space is not a vector space, nor a convex set and therefore finding the element of that best approximates a given function cannot be simply interpreted as an orthogonal projection.
The use of DNN as function approximations is justified by the fundamental results of Hornik (1991) (see also Pinkus (1999) for related results).
Theorem 2.1** (Universal Approximation Theorem).**
*Assume that the function is non constant and bounded. Let denote a probability measure on , then for any , is dense in . *
Theorem 2.2** (Universal Approximation Theorem).**
*Assume that the function is a non constant, bounded and continuous function, then, when , is dense into for the topology of the uniform convergence on compact sets. *
Remark 2.3**.**
We can rephrase Theorem (2.1) in terms of approximating random variables. Let be a real valued random variable defined on s.t. . Let be an other random variable defined on taking values in and the smallest algebra such that is measurable. Then, there exists a sequence , such that . Therefore, if for every , solves
[TABLE]
then the sequence converges to in when . Note that as long as the activation function is bounded, for every .
3 The algorithm
3.1 Description of the algorithm
We recall the dynamic programming principle on the optimal policy
[TABLE]
Then, the time price of the Bermudan option writes
[TABLE]
In order to solve this dynamic programming equation we need to compute a conditional expectation at each time step. The idea proposed by Longstaff and Schwartz (2001) was to approximate these conditional expectations by a regression problem on a well chosen set of functions. In this work, we use a DNN to perform this approximation.
[TABLE]
where solves the following optimization problem
[TABLE]
Since the conditional expectation operator is an orthogonal projection, we have
[TABLE]
Therefore, any minimizer in (8) is also a solution to the following minimization problem
[TABLE]
The standard approach is to sample a bunch of paths of the model along with the corresponding payoff paths , for . To compute the ’s on each path, one needs to compute the conditional expectations for . Then, we introduce the final approximation of the backward iteration policy, in which the truncated expansion is computed using a Monte Carlo approximation
[TABLE]
where solves the sample average approximation of (8)
[TABLE]
Then, we finally approximate the time price of the option by
[TABLE]
Remark 3.1**.**
Note that to implement the previous algorithm we need to compute a minimizer for the optimization problem (10). Obviously this is not an easy task as this is a high-dimensional, non-convex and non smooth problem.
It is usually solved in practice using toolboxes as Scikit-Learn or TensorFlow, by means of a stochastic gradient descent method for which a full convergence proof under realistic assumptions are still unknown in our knowledge. See Bottou et al. (2018) or E et al. (2020) for recent in depth reviews of these subjects and Ghadimi and Lan (2013), Lei et al. (2019), Fehrman et al. (2020) for results for a non convex function.
4 Convergence of the algorithm
We start this section on the study of the convergence by introducing some bespoke notation following Clément et al. (2002).
4.1 Notation
First, it is important to note that the paths for are identically distributed but not independent since the computations of at each time step mix all the paths. We define the vector of the coefficients of the successive expansions and its Monte Carlo counterpart .
Now, we recall the notation used by Clément et al. (2002) to study the convergence of the original Longstaff Schwartz approach.
Given a deterministic parameter in and deterministic vectors in and in , we define the vector field by
[TABLE]
Note that does not depend on the first components of , ie depends only . Moreover,
[TABLE]
Moreover, we clearly have that for all
[TABLE]
4.2 Deep neural network approximations of conditional expectations
Proposition 4.1**.**
*Assume that . Then, in for all . *
Remark 4.2**.**
Note that in the proof of Proposition 4.1, there is no need for the sets to be compact for every . We could have chosen . However, the boundedness assumption will be required in the following section, so to work with the same approximations over the whole paper, we have decided to impose compactness on for every .
Proof*.*
q We proceed by induction. The result is true for as . Assume it holds for (with ), we will prove it is true for . For this, using both recursion equations, we have
[TABLE]
Now, defining as
[TABLE]
we obtain
[TABLE]
By the induction assumption, the term goes to zero in as goes to infinity. So, we just have to prove that converges to zero in when . For this, note that
[TABLE]
So we obtain
[TABLE]
Morevoer, as the conditional expectation is an orthogonal projection, we clearly have that
[TABLE]
Then, the induction assumption for yields that the second term on the r.h.s of (14) goes to zero in when .
To deal with the first term on the r.h.s of (14), we introduce for any , defined as a minimiser to
[TABLE]
Note that is the best approximation on of the true continuation value at time . As solves (9), we clearly have that
[TABLE]
Using the induction assumption for , the second term on the r.h.s of (Proof) goes to zero in and from the universal approximation theorem (see Theorem 2.2 and Remark (2.3)), we deduce that
[TABLE]
Then, we conclude that .
The next proposition show that if we have an estimate of the speed of convergence for the network approximation for a suitable class of functions we are able to derive the speed of convergence for the Bermudan option price.
Proposition 4.3**.**
Assume that for every , there exists a sequence of positive real numbers such that
[TABLE]
Then,
[TABLE]
Proof*.*
We use the same notation as in the proof of Proposition 4.1. We proceed by backward induction.
First note that and then, using (13), . Moreover, from (14), we get that
[TABLE]
Using again that , we deduce that
[TABLE]
Therefore,
[TABLE]
Assume the result holds true for (with , we will prove it is true for .
[TABLE]
where we have used (15). Then, using the induction assumption, we get
[TABLE]
[TABLE]
where the last inequality comes from (Proof).
From the induction assumption, the second term is bounded by . From (18), the first term is bounded by . Then, we conclude that when
[TABLE]
4.3 Convergence of the Monte Carlo approximation
In the following, we assume that is fixed and we study the convergence with respect to the number of samples . First, we recall some important results on the convergence of the solution of a sequence of optimization problems whose cost functions converge.
4.3.1 Convergence of optimization problems
Consider a sequence of real valued functions defined on a compact set . Define,
[TABLE]
and let be a sequence of minimizers
[TABLE]
From (Rubinstein and Shapiro, 1993, Chap. 2), we have the following result.
Lemma 4.4**.**
*Assume that the sequence converges uniformly on to a continuous function . Let and . Then and a.s. *
In the following, we will also make heavy use of the following result, which is a restatement of the law of large numbers in Banach spaces, see (Ledoux and Talagrand, 1991, Corollary 7.10, page 189) or (Rubinstein and Shapiro, 1993, Lemma A1).
Lemma 4.5**.**
Let be a sequence of i.i.d. -valued random vectors and be a measurable function. Assume that
- •
a.s., is continuous,
- •
.
Then, a.s. converges locally uniformly to the continuous function , ie
[TABLE]
4.3.2 Strong law of large numbers
To prove a strong law of large numbers, we will need the following assumptions.
- (-1)
For every , , there exist and s.t.
[TABLE]
Moreover, for all , a.s. the random functions are continuous. Note that as is a compact set, the continuity automatically yields the uniform continuity.
- (-2)
For defined in (-1), for all .
- (-3)
For all , and all , .
We introduce the notation
[TABLE]
Note that is a non void compact set.
- (-4)
For every , and every , for all ,
[TABLE]
Remark 4.6**.**
Assumption (-1) is clearly satisfied for the classical activation functions ReLU , sigmoid and . When the law of has a density with respect to the Lebesgue measure, the continuity assumption stated in (-1) is even satisfied by the binary step activation function .
Remark 4.7**.**
Considering the natural symmetries existing in a neural network, it is clear that the set will hardly ever be reduced to a singleton. So, none of the parameters or is unique. Here, we only require the function described by the neural network approximation to be unique but not its representation, which is much weaker and more realistic in practice. We refer to Albertini et al. (1993); Albertini and Sontag (1994) for characterization of symmetries of neural networks and to Williamson and Helmke (1995) for results on existence and uniqueness of an optimal neural network approximation (but not its parameters).
To start, we prove the convergence of the neural network approximation of the conditional expectation at each time step.
Proposition 4.8**.**
*Assume that Assumptions (-1)-(-4) hold. Let be a minimiser of optimization problem (19) and be a minimiser of the sample average optimization problem (10), then, for every , converges to a.s. as . *
Lemma 4.9**.**
For every ,
[TABLE]
Proof* (Proof of Proposition 4.8).*
We proceed by induction.
Step 1. For , solves
[TABLE]
We aim at applying Lemma 4.5 to the sequence of i.i.d. random functions . From Assumptions (-1) and (-2), we deduce that
[TABLE]
Then, Lemma 4.5 implies that a.s. the function
[TABLE]
converges uniformly to . Hence, we deduce from Lemma 4.4 that a.s. when . We restrict to a subset with probability one of the original probability space on which this convergence holds and the random functions are uniformly continuous, see (-1). There exists a sequence taking values in such that , when . The uniform continuity of the random functions yields that
[TABLE]
Then, we conclude from Assumption (-4), that
Step 2. Choose and assume that the convergence result holds for , we aim at proving this is true for . We recall that solves
[TABLE]
We introduce the two random functions for
[TABLE]
The function clearly writes as the sum of i.i.d. random variables. Moreover, by combining (12) and Assumptions (-1) and (-2), we obtain
[TABLE]
Then, the sequence of random functions a.s. converges uniformly to the continuous function defined for by
[TABLE]
It remains to prove that a.s. when .
[TABLE]
where we have used (12) and Assumptions (-1) and (-2). Then from Lemma 4.9, we can write
[TABLE]
where is a generic constant only depending on , and .
Let . Using the induction assumption and the strong law of large numbers, we have
[TABLE]
From (-3), we deduce that a.s. and we conclude that a.s. converges to zero uniformly. As we have already proved that a.s. converges uniformly to the continuous function , we deduce that a.s. converges uniformly to . From Lemma 4.4, we conclude that a.s. when . We restrict to a subset with probability one of the original probability space on which this convergence holds and the random functions are uniformly continuous, see (-1). There exists a sequence taking values in such that when . The uniform continuity of the random functions yields that
[TABLE]
Then, we conclude from Assumption (-4), that when .
Now that the convergence of the expansion is established, we can study the convergence of to when .
Theorem 4.10**.**
Assume that Assumptions (-1)-(-4) hold. Then, for and every ,
[TABLE]
Proof*.*
Note that and by the strong law of large numbers
[TABLE]
Hence, we have to prove that
[TABLE]
For any , and , . Using Lemma 4.9 and that , we have
[TABLE]
Using Proposition 4.8, for all , when . Then for any ,
[TABLE]
where the last inequality follows from the strong law of larger numbers as . We conclude that by letting go to [math] and by using (-3).
The case proves the strong law of large numbers for the algorithm. Note that solving the minimisation problem (10) mixes all stopped paths , it is unlikely that the estimators for are unbiased. We recall that and . Then,
[TABLE]
where we have used that all the random variables have the same distribution.
5 Numerical experiments
In this section, we compare the results given by the standard Longstaff Schwartz approach with polynomial regression to the algorithm described in Section 3. The only difference between the two methods lies in the way of approximating the conditional expectation at each time step. The two algorithms are implemented in Python using the PolynomialFeatures toolbox of scikit-learn (Pedregosa et al. (2011)) for the polynomial regression and the tensorFlow toolbox (Abadi et al. (2015)) to compute the neural network approximation. We have chosen options for which there is a substantial gap between the European and Bermudan prices, which means that there exists indeed an early exercise strategy and that the accuracy of the conditional expectations approximations plays a major role.
Details on the algorithm used in the experiments
In all the experiments, we have run our algorithm times to compute the average price along with the half-width of the confidence interval for the price estimator reported in the tables between parentheses in the form . Although the confidence interval is informative to know how much we can trust a price, it completely squeezes the bias related to the approximation of the conditional expectations. Remember that the estimator given by (11) is not an unbiased estimator and one should therefore be very careful when comparing the results. Keep in mind that a higher price does not always mean a better price.
For the activation function in (5), we have used the leaky ReLU function defined by
[TABLE]
We relied on the ADAM algorithm to fit the neural network at each time step and the columns epochs refer to the number of times we go through the entire data set to train the network. Note that using epochs corresponds to the standard approach used in online stochastic approximation, in which each data is used only once. We use the same neural network through all the time steps and in particular at a time step , we take the optimal parameter at time , , as the starting point of the training algorithm. Because of this smart choice, there is actually no use setting epochs for . We observed in our numerical experiments that passing over all the data several times does not reduce the training error at times , whereas it does help when fitting the first neural network at time . This allows for huge computational time savings.
For learning the continuation value at each exercising date, we only use the in-the-money paths as already suggested in the original Longstaff Schwartz algorithm Longstaff and Schwartz (2001). This means that the definition of the optimization problem (8) has to be changed into
[TABLE]
The empirical counterpart (10) needs to be adapted in a similar way. Note that it does not change the theoretical analysis of the algorithm but it is numerically more efficient. We proceed similarly in the original Longstaff Schwartz algorithm we are comparing to in the next sections.
5.1 Examples in the Black Scholes model
The dimensional Black Scholes model writes for
[TABLE]
where is a Brownian motion with values in , is the vector of volatilities, assumed to be deterministic and positive at all times and is the -th row of the matrix defined as a square root of the correlation matrix , given by
[TABLE]
where to ensure that is positive definite.
5.1.1 Benchmarking the method on the one-dimensional put option
Before investigating more elaborate numerical examples, we want to test our method on the one dimensional put option. As standard as this example might be, getting a trustworthy reference price is not an easy task. For this example, we compare our approach to the benchmark price computed by a convolution method in Lord et al. (2008) and later used as a reference price in Fang and Oosterlee (2009). Their reference price is where all the digits are accurate.
We can see from Table 1 that using a really small neural network with only one input layer with intermediate neurons and one output layer — meaning that the activation function is applied only once — already yields very good results with a relative accuracy greater than . Increasing the number of epochs helps correct the bias created by the truncated approximation of the conditional expectations. The larger the neural network (see in particular the cases ), the more epochs we need to ensure that the fitting procedure has sufficiently well converged in order to make the most of the capabilities of the network to accurately approximate the conditional expectations. Note that increasing the size of the network also increases the overall variance of the algorithm as in the case of a polynomial regression when the size of the regression basis increases (see Glasserman and Yu (2004) for details).
5.1.2 A geometric basket option in the Black Scholes model
Benchmarking a new method on high dimensional products becomes hardly feasible as almost no high dimensional Bermudan options can be priced accurately in a reasonable time. An exception to this is the geometric put option with payoff . Easy calculations show that the price of this dimensional option equals the one of the dimensional option with the following parameters
[TABLE]
In every numerical experiments on the geometric basket option, we report the price of the equivalent one dimensional Bermudan option obtained by the CRR tree method Cox et al. (1979) with discretization time steps.
We can see from the numerical results of Table 2 that even a small neural network is able to capture the continuation values very well. Increasing the size of the network does not help get a better price but increases the variance unless we ensure a very accurate fit of the network by going through the data several times (see the column epochs=10 for instance), which in turn leads to a much larger computational cost. In comparison, the prices obtained with the standard Longstaff Schwartz algorithm with polynomial regressions of order respectively , and are , and . On this small dimensional example, a low degree polynomial as well as a small neural network give a very accurate price.
The numerical results for the dimensional geometric put option (see Table 3) show the same behavior as the low dimensional problem. Using a small neural network provides very accurate results within of the true price. Passing several times over the data to train the network helps a little reduce the bias of the price estimator but at the expense of a much higher computational effort. In comparison, the prices obtained with the standard Longstaff Schwartz algorithm with polynomial regression of order respectively and are and . Note that a regression of order is unreachable in dimension . Unlike all other examples, in which the standard Longstaff Schwartz algorithm with polynomial regression tends to exhibit a systematic negative bias, increasing the polynomial degree in this example yields a price above the true one. Note that the true price is always within the confidence intervals reported in Table 3. Our method does not seem to suffer from this positive bias phenomenon.
Finally, we tested our approach on a dimensional geometric basket option with two different number of Monte Carlo samples (Table 4) and (Table 5). In both cases, the results obtained with really small neural networks () are already very accurate. However, one should note that increasing the number of epochs with leads to upper biased prices. This is the result of an overfitting phenomenon during the neural network calibration. Remember that the number of parameters of the network is . For and , it already gives more than 26 million parameters. The learning capabilities of the network are such that it also learns the noise inside the data. Increasing the size of the data set () fixes this issue as one can see in Table 5.
5.1.3 A put basket option
We consider a put basket option with payoff
[TABLE]
We test our algorithm in dimension and report the results in Table 6. The standard Longstaff Schwartz algorithm yields (resp. ) for an order (resp. ) polynomial regression. The prices reported in Table 6 are very close to the one obtained with an order polynomial regression. We can see that using a very large neural networks with several hidden layers and several hundreds of neurons per layer does not really help. The results obtained for a small network with a few dozens of neurons are already very good. The difference between the results for epoch and epochs is about half the width of the confidence interval, which makes it non meaningful. Hence, there is no use putting more computational effort to go through all the data set more than once.
Now, we turn to a high-dimensional problem and consider a call option on a basket with assets to test the scalability of a our approach. The results are reported in Table 7 for Monte Carlo samples and in Table 8 for Monte Carlo samples. As a comparison, Goudenège et al. (2019) reported prices between and for the same option. Our prices lie in this interval. We note that increasing the number of epochs for a relatively small number of Monte Carlo samples gives larger prices. This is all the more striking as the size of the neural network is large, which clearly exhibits an over-fitting phenomenon. Indeed, increasing the number of samples to fixes the issue as the prices in Table 8 are between and for all the neural network configurations and the number of epochs up to . This clearly shows that to avoid an upper bias, we need to increase the number of samples when the size of the problem increases, which was already noted with the standard Longstaff Scwhartz algorithm using polynomial regression. With this example, we come to the same conclusions as for the dimensional geometric basket option studied in Section 5.1.2.
5.1.4 A call on the maximum of several assets in the Black Scholes model
We consider a call option on the maximum of assets in the Black Scholes model with payoff
[TABLE]
The different sets of parameters are chosen as in Becker et al. (2019a); Goudenège et al. (2019) to easily compare the prices obtained with the different methods.
Pricing a call on the maximum of a basket of assets is usually far more difficult than a standard basket option because of the strong non linearity of the maximum function. For the example of Table 9, the standard Longstaff Schwartz algorithm yields , , for a polynomial regression of order , and respectively. The prices obtained with the polynomial regression vary a lot with the degree of the regression. For this example, Becker et al. (2019a) reported a confidence interval of . The prices reported in Table 9 are very close to this confidence interval. A small neural network () enables us to get values within of the true price, which is a great achievement considering the complexity of the product and the small size of the approximation. As in the other examples, using several passes through the data to train the neural network does not really bring any improvement for small neural networks. For larger networks, it helps a little but in the end larger networks are less accurate than smaller ones. To get the best of larger neural networks, we would need more data to train the networks, ie. more Monte Carlo samples as we already observed for the high dimensional geometric put option.
We tested the scalability of our approach on a dimensional max-call option. The results are reported in Table 10 for and Table 11 for . Becker et al. (2019a) report as the confidence interval for the option price. We obtain prices very close these values. As a comparison, the standard Longstaff Schwartz algorithm yields and for a polynomial regression of order and respectively. The order regression is out of reach. Our deep learning approach scales far better than the standard polynomial regression. Increasing the number of epochs improves the result, which means that the neural network has to be much more finely tuned than in the other examples studied sofar.
5.2 A put option in the Heston model
We consider the Heston model defined by
[TABLE]
For the simulation of the model, we use a modified Euler scheme with time steps per year, in which we have replaced by to deal with possibly negative values of the discretized volatility process. For the option of Table 12, the standard Longstaff Schwartz algorithm yields (resp. ) for an order (resp. ) polynomial regression. As in the other examples, the use of a neural network as the regressor provides very accurate results even with a quite small network (no hidden layer and very few neurons, see the case and ).
6 Conclusion
The difficulties in pricing Bermudan options come from approximating the continuation value at each exercising date. While polynomial regression is widely used for this step, we have investigated the use of deep learning. We have proved the theoretical convergence of our algorithm with respect to both the neural network and Monte Carlo approximations. Our numerical experiments show that the prices computed using our approach are very similar to those obtained from the standard Longstaff Schwartz algorithm. With no surprise, using neural networks does not help much for low dimensional problems but does scale far better on high dimensional problems as it does not suffer from the curse of dimensionality as much as polynomial regression does. Polynomial regression requires a relatively high order to provide accurate prices, which is not feasible in high dimensional problems. Neural networks approximation capabilities seem far better and relatively small networks already provided very accurate results. Indeed, a few hundred neurons with no hidden layers were sufficient to have very accurate prices. Training a neural network usually requires several passes through the whole data set. Yet, in our examples this seemed pretty much useless mostly because the functional representation of the continuation function should not vary much over time. So, once the neural network has been well trained, one pass over the data () is enough to fit the network at a new date. This saves a lot of computational time. Neural networks have proved to be a very versatile and efficient tool to compute Bermudan option prices especially when the problem is highly non linear.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadi et al. [2015] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-sc
- 2Albertini and Sontag [1994] F. Albertini and E. D. Sontag. Uniqueness of weights for recurrent nets. MATHEMATICAL RESEARCH , 79:599–599, 1994.
- 3Albertini et al. [1993] F. Albertini, E. D. Sontag, and V. Maillot. Uniqueness of weights for neural networks. Artificial Neural Networks for Speech and Vision , pages 115–125, 1993.
- 4Arnold [2009] V. I. Arnold. On functions of three variables. Collected Works: Representations of Functions, Celestial Mechanics and KAM Theory, 1957–1965 , pages 5–8, 2009.
- 5Bally and Pages [2003] V. Bally and G. Pages. A quantization algorithm for solving multidimensional discrete-time optimal stopping problems. Bernoulli , 9(6):1003–1049, 2003.
- 6Becker et al. [2019 a] S. Becker, P. Cheridito, and A. Jentzen. Deep optimal stopping. Journal of Machine Learning Research , 20(74):1–25, 2019 a.
- 7Becker et al. [2019 b] S. Becker, P. Cheridito, A. Jentzen, and T. Welti. Solving high-dimensional optimal stopping problems using deep learning, 2019 b.
- 8Bottou et al. [2018] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. Siam Review , 60(2):223–311, 2018.
