Learning Threshold-Type Investment Strategies with Stochastic Gradient Method
Zsolt Nika, Mikl\'os R\'asonyi

TL;DR
This paper introduces a stochastic gradient-based learning algorithm for threshold-type investment strategies in online portfolio optimization, demonstrating its convergence and effectiveness across various stock price models.
Contribution
It is the first systematic study applying the Kiefer--Wolfowitz stochastic gradient method to learn optimal threshold strategies in portfolio optimization.
Findings
The algorithm converges to the log-optimal threshold strategy.
Optimal threshold strategies exist across diverse stock price models.
Hyperparameter tuning can be effectively performed with limited data.
Abstract
In online portfolio optimization the investor makes decisions based on new, continuously incoming information on financial assets (typically their prices). In our study we consider a learning algorithm, namely the Kiefer--Wolfowitz version of the Stochastic Gradient method, that converges to the log-optimal solution in the threshold-type, buy-and-sell strategy class. The systematic study of this method is novel in the field of portfolio optimization; we aim to establish the theory and practice of Stochastic Gradient algorithm used on parametrized trading strategies. We demonstrate on a wide variety of stock price dynamics (e.g. with stochastic volatility and long-memory) that there is an optimal threshold type strategy which can be learned. Subsequently, we numerically show the convergence of the algorithm. Furthermore, we deal with the typically problematic question of how to…
| Scaling | AR(1) | DGSV | |||
|---|---|---|---|---|---|
| Dataset-1 () | Dataset-2 () | Dataset-1 () | Dataset-2 () | ||
| No scaling | 7.8 | 1.8 | 6.7 | 19.2 | |
| st. dev. | 11.4 | 17.9 | 53 | 120.0 | |
| st. dev. | 1.7 | 1.6 | 20 | 8.8 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Time Series Analysis · Stochastic processes and financial applications · Advanced Bandit Algorithms Research
Learning Threshold-Type Investment Strategies with Stochastic Gradient Method
Zsolt Nika
Faculty of Information Technology and Bionics
Pázmány Péter Catholic University
Budapest, Hungary
Miklós Rásonyi
Alfréd Rényi Institute of Mathematics
Hungarian Academy of Sciences
Budapest, Hungary
(June 2019)
Abstract
In online portfolio optimization the investor makes decisions based on new, continuously incoming information on financial assets (typically their prices). In our study we consider a learning algorithm, namely the Kiefer–Wolfowitz version of the Stochastic Gradient method, that converges to the log-optimal solution in the threshold-type, buy-and-sell strategy class.
The systematic study of this method is novel in the field of portfolio optimization; we aim to establish the theory and practice of Stochastic Gradient algorithm used on parametrized trading strategies.
We demonstrate on a wide variety of stock price dynamics (e.g. with stochastic volatility and long-memory) that there is an optimal threshold type strategy which can be learned. Subsequently, we numerically show the convergence of the algorithm. Furthermore, we deal with the typically problematic question of how to choose the hyperparameters (the parameters of the algorithm and not the dynamics of the prices) without knowing anything about the price other than a small sample.
Keywords
Stochastic gradient; Log-optimal investment; Online portfolio selection
1 Introduction
In investment there is an approach by technical analysts where the investment decision is based on past data such as price, technical indicators or trading volumes. The decisions are determined by some function of past data called trading rule or strategy function. In algorithmic trading, these decisions are executed automatically by computers [1]. One of the most typical subcategories of algorithmic trading is high-frequency trading, where favorable decisions must be made in seconds or even miliseconds [2].
Given the nature of these algorithms, those that require huge computational capacities are not efficient since they are slow. For example, non-parametric methods or complex machine learning algorithms work well on big data sets with immense computer efforts (see a survey of non-parametric methods e.g. in [3] and a summary in machine learning methods [4]). On the other hand, parametric models that are based on dynamics of the prices or indices may give fairly good results if precise and accurate parameter estimations are available. To get satisfactory estimations, again one needs a big data set, and typically decisions are sensitive to the error of the estimations.
To resolve the above-mentioned problems we use Kiefer–Wolfowitz method which lets us (i) make decisions immediately starting at the initial step; (ii) process new information/data as they arrive, without needing to wait until we have a big enough data set, as the strategy function improves in every step and (iii) there is no need to estimate the parameters of the dynamics. With this method, we aim to optimize log-utility investments (maximizing the expected value of the logarithm of the wealth). The method is also capable of tracking the changes of the market, which we ignore here in order to investigate the method itself in finance instead of market changes.
Stochastic Approximation [5] (or Robinson–Monro method) is an iterative method to find the root of , where is a stochastic process, is a prameter and is a constant. Basically, it is a stochastic version of the Newton–Raphson method where there are consecutive observations of the functions loaded with randomness/noise. If the derivate of the function exists then the method can be used for optimization. When the derivate does not exist or is unknown, Kiefer and Wolfowitz proposed [6] a finite difference approximation based on consecutive observations. This method is a version of the Stochastic Gradient method.
Nonetheless, stochastic approximation or Kiefer–Wolfowitz has not been used for directly optimizing the parametrized strategy as we do here. We hope that this introduction to the usage of the method in investment theory will develop further. Other works dealt with different approaches, like learning the parametrized stopping time for American/Asian options [7] or the optimal stopping time of liquidation [8]. Other typical fields of applying this algorithm are the estimation of quantiles for CVaRs [9], [10]or [11]. There is also a study about the optimal splitting of orders in [12].
In Section 2 we introduce the threshold strategies that can be parametrized in a way that it can be optimized by the Kiefer–Wolfowitz algorithm. Then in Section 3 we show how the algorithm works and that the optimum exists. In Section 4 we show numerical results how the algorithm performs and we also deal with the problem how to choose the hyperparameters of the algorithm in a suitable way. Throughout this article we make some usual simplifications in investment theory: the investment solely contains one risky and one riskless asset, of course, these can relaxed. Given the nature of the learning method, we only focus on discrete time models.
2 Threshold strategies in log-optimal investments
In this section, we introduce the financial background in which we want to apply the learning method in the next sections. At first, we start with some preliminary information about investment, after which we present our threshold type strategy and we discuss how it connects to the theory of log-optimal investments.
2.1 Portfolio
Portfolio investment, mathematically speaking, is an applied field of control theory where the control process is the investor’s decision regarding in which asset to allocate her/his current wealth, and the independent processes are typically the prices of the financial assets. Logarithmic utility function is used frequently as an objective function for several reasons.
Let us denote the riskless process and the risky asset , where is the discrete-time parameter. In this study, we do not want to focus on the effect of the interest rate, therefore, the riskless process is chosen to be constant (it assumes zero interest rate, so we do not need to discount the prices). The value of the portfolio is the wealth of the invester and its time-evolution is typically written as
[TABLE]
where is an -measurable function, called strategy, i.e. the fraction of how much of the current wealth should be split between the two assets. Clearly, can only be a function of information up to since the investor is not able to look in the future. In financial mathematics, the log increment of the price has several well-established properties mathematically, which are called the stylized facts of stock prices [13]. It is convenient to build a dynamics on the log-return and not on the stock price. The log-return
[TABLE]
Since the riskless asset’s price is constant (therefore their fraction is one) we can simplify the wealth as
[TABLE]
The investor’s objective is to maximize the utility function
[TABLE]
It has been showed in [14] that it can be maximized if the strategy is chosen such as to maximize the conditional expectation of the growth
[TABLE]
or with our financial conditions it equals to
[TABLE]
The conditional expectation is a random variable and measurable on . The condition on contains a lot more information that is accessible for an investor or anyone. In an algorithm we need to specify what information we use (for example past prices or stock market indices) therefore we can only optimize a conditional mean where the condition is a random variable. We denote it as
[TABLE]
where is an -measurable (multivariate) random variable.
In the following sections we show how to parametrize the strategy process to be able to learn the log-optimal strategy and then how to choose the variable .
2.2 Dynamics
The present method can be used on several type of stock price dynamics. It is important to use such dynamics where (i) the optimal strategy exists and results in a portfolio which achieves its optimality and (ii) the price dynamics is realistic, plausible. For this reason we rely on the time series class introduced in [15] called Conditionally Gaussian and one of its example, the Discrete Gaussian Stochastic Volatility (DGSV):
[TABLE]
This stock price model posseses several desirable properties: its statistical moments and auto-correlation function are realistic, includes long-memory and leverage effect as well. The existence of the log-optimal solution is provided in [15].
We also use simplier models to understand better the behavior of the algorithm. Such as AR(1) or MA() processes:
[TABLE]
[TABLE]
The coefficients and choosing and ensure that 2.7 has long memory.
2.3 Threshold strategy
The log-optimal strategy in (2.3) only can be calculated if the parameters of the stock price dynamics are known. The exact form of the strategy in unknown, one need to use numerical integration to get the optimal decision at every timestep .
In Section 3 of [15] an approximative strategy of the log-optimal was proposed. They showed that on realistic data it performs well, though they did not give mathematical estimation of the error. This approximative strategy reduces the space of possible decisions from to two states . With realistic log-return data this restriction does not result in a considerable loss and it can be used with learning algorithms while the log-optimal solution can’t.
The idea can be used for any parametric dynamics if the conditional expectation can be calculated. The proposed approximative strategy in [15] is
[TABLE]
which is a consequence of the requirement in (2.3) with first-order Taylor-expansion. That is, the investor should buy only risky asset if its conditional expected value is higher than 0. This strategy lies in the field of threshold strategy.
We remark, that we are working now in 0 interest rate environment. Without this assumption the trading rule modifies to buy whenever the conditional expectation is higher than the interest rate.
Because of the structure of the strategy, we call it here threshold strategy. We do not need the upperscript since we are only investigating this type of strategy with the Stochastic Gradient method.
In most parametric models the conditional expectation can be calculated therefore we end up with a function of past data that we call here threshold-function: . An equivalent form of 2.8 using the threshold function is
[TABLE]
where the function is 1 if and 0 otherwise. The conditional expectation of the growth (2.3) that we want to optimize here with the indicator function is
[TABLE]
This function is still a random variable because it is a function of past data.
In the following subsections we unfold some cases how to handle "past data", but of course, it is the investors duty to tell, which past values to use. Proposition 2.1 gives help how and what to take into consideration when someone chooses values from past data.
With Stochastic Gradient method we are able to optimize an expected value with respect to some parameters. Therefore in the following we will optimize the exptected value of . If we parametrize the conditional growth by which is a one or multivariate real number, than the optimization task is to find the maximum of the growth
[TABLE]
where is a parametrized version of (2.3).
2.4 Markovian strategy
Let us assume the investor uses only one value that is available before investing at time and call this variable . It can be past stock returns or an index or something more complex, for example the weighted average of the past returns. A natural choice can be the previous value of the return, that is and we stick to this simple case here.
The conditional growth in (2.3) with Markovian strategy:
[TABLE]
We need to parametrize the threshold function in the strategy to be able to use it with Stochastic Gradient method. A convenient choice is the linear function; in this paper we do not relieve this restriction but we mention that can be a function of though.
[TABLE]
The optimizable growth in (2.11) is
[TABLE]
In Section 3 we will optimize this function with the Kiefwer–Wolfowitz method.
The theorem below shows the optimal threshold () of the Markovian strategy.
Proposition 2.1**.**
Let us assume that there is only one root of the differentiable function and that if . Moreover, let us assume that the return process is stationary. Then the root of is the unique optimal threshold:
[TABLE]
Proof.
For the sake of simplicity assume that has a pdf. The conditional expectation of the growth is
[TABLE]
Since is a function of , call it and denote the pdf of as , the expected value is
[TABLE]
The integral has optimum where
[TABLE]
Since is non-negative therefore the optimal threshold is where which conclude our statement.
∎
Remark 1**.**
The main message of the theorem is that only those information can be used in the optimization algorithm which are not mean-independent [16] from the price process. The concept of mean independence is well-known in econometrics which is a stronger property than uncorrelation but weaker than the stochastic independence.
Remark 2**.**
A conclusion of Proposition 2.1 is that the linear approximative strategy is log-optimal if the strategy can only be 0 or 1. This is only true in the univariate case.
For a simple example, let us model the log-return as an autoregressive process and let us use the previous log-return value as "past data".
Example 2.1** (AR(1)).**
Let defined as in (2.6). The conditional expectation is
[TABLE]
Its root, that is the optimal threshold is
[TABLE]
When , then the assumption in Theorem 2.1 about if is false, but the optimality is true if we change the inequality sign in (2.13) to .
(We remind the reader, that the expected log-return is different, .)
As we can see from the example, to determine the threshold we either need to estimate and from a long enough sample or either we learn the value of by using Stochastic Gradient. In a more realistic dynamics there are more than two parameters that needed to be estimated. Furthermore the threshold is very sensitive to the estimation error of .
Example 2.2** (DGSV).**
Let the log-return be a DGSV process according to (2.5). Its conditional expectation is
[TABLE]
The conditional expectation is unknown but we will see later in the numerical results that there is a unique solution.
2.5 Non-Markovian strategy - multivariate case
If the investor rather would like to use more information for example to handle long memory or information about volatility, it is also possible. We show here two possible choices that can be used, one strategy uses multiple past return data, the other one uses volatility information as extra. The strategies
[TABLE]
where are the parameters we wish to optimize and is an estimation of the logarithm of the volatility based on the information of (that is, ). The design of the second strategy with the log-volatility may seem peculiar but the linear approximation strategy of the log-optimal in [15] has been showed that it is a linear function of and .
An important aspect of the strategy choice with volatility, is that we are able to catch leverage effect with it. As we noted in Remark 1, only those processes should be used in the threshold function which are not mean-indepenent of the log-return. Leverage effect is defined in several ways, anyhow it is a connection between stock price change and past volatility (i.e. in our case between and ). Noises in the price that have no leverage effect, for example the noise term in 2.5, have no advantagesin the investment.
Leverage effect has a prominent role, since it is the only way how we can utilize volatility but the long memory typically appears in volatility. As it has been show in [13], the long memory is hidden in volatility and not in the drift part of the process.
In the multivariate case there is no closed form of the optimal values. Of course, the must be satisfied. For example, in two dimensions version of (2.16a) the optimal ’s must satisfy the
[TABLE]
equations, where and is the joint pdf of . The equations are more complicated in the DGSV case if we wish to include the log-volatility then we need to replace the variable and reinterpret the pdf and conditional mean (by using instead of . These are unknown functions in general and we could only estimate the pdf and the conditional expectation based on data which is contrary to our goals.
It does not mean that the Kiefer–Wolfowitz algorithm cannot converge to the optimal ’s, only that we cannot calculate their optimal values in advance. If the dynamics are known then Monte-Carlo method can be used to estimate the optimal value. This is what we use in the numerical simulations.
Here we would like to show the basics of how to use the Kiefer–Wolfowitz algorithm for investment purposes. Other processes could also be used.
3 Kiefer–Wolfowitz algorithm
With the Kiefer-Wolfowitz optimization procedure we are searching for the maximum of (2.11).
Univariate case:
the task is to find the optimum threshold
[TABLE]
the random processes and are both univariate. Let us denote the growth at time by . The Stochastic Gradient algorithm uses the finite differences of the growth:
[TABLE]
where the step-size and the step-size of the finite difference are real-valued sequences. The fraction is the approximation of the gradient.
Since the growth is the indicator function of , therefore its finite difference can be simplified to a range. For greater clarity we denote the range as . Then the algorithm can be written as
[TABLE]
This formalism will help us in the latter to better understand the usage of the method.
It is impossible to prove in general but via some examples in the Section 4 we show nuerically that this recursive update converges to the optimum what we showed in the previous section:
[TABLE]
the convergence is in , i.e. we can show the convergence of the Mean Squared Error (MSE). If the convergence is accomplished, its speed has power-law typically.
In general, there is no straightforward way to choose the hyperparameters. In Section 4 we show some ideas on which basis we can choose the hyperparameters.
In real life investment the financial environment is not static, the dynamics of prices can change and new factors can appear/disappear, therefore optimal strategy changes as well. To this end, in practice investors use constant and very small step sizes and which able to track down the changes of the optimal values. In this paper we do not aim to focus on changes of the market.
Multivariate case:
the algorithm works in the same way, each dimension of the parameter are updated separatly with no cross-effect. For example in the case of known log-volatility (2.16b) the growth is
[TABLE]
4 Numerical Results
The critical part of every algorithm is the choice of the hyperparameters. In their paper, J. Kiefer and J. Wolfowitz [6] also address the issue of parameter-choice though they were able to give exact and sufficient conditions in a simplier context. These conditions are typical requirements and our model satisfy them as well:
. 2. 2.
, that is, the algorithm can reach any state. 3. 3.
. 4. 4.
.
A usual first guess choice is and .
Analyzing the growth function in the univariate case help us to construct the step-sizes in a suitable way. Figure 1 and (3.3) make it clear that must stay in the same range as , since would result in constant . In the numerical simulations we only show results about the case. On the two example we can make the following remarks:
- •
, low means that , that is, the wealth equals to the price of the stock.
- •
, high means that , the wealth equals to the price of the bond.
- •
In the simple case when is an autoregressive process and also when it has the more complex, realistic dynamics DGSV, there is a unique that can be calculated.
- •
If takes value out of the typical value of where the derivate of is zero then it is hopeless for the algorithm to return and it stays there.
To overcome on the problem of the last remark we make some modifications on the algorithm. First, the inital value must be estimated on a small sample of . In every realization we used 10 data points to initialize . This very small sample is already enough for the algorithm to start from a relatively good point. Second, we cannot let the algorithm to take any large step. A general solution for this is to use a project on a subspace. In our case we do a truncation on the known range of :
[TABLE]
where .
Using the simple parametrization and can work in a simple setting. Figure 2 show hot the simple choice of the hyperparameters work. The simulations were executed with realizations and for time steps. The Mean Squared Error (MSE) is an approximation of the error. The log-log scale plot of the error shows that MSE has power law decays in both cases.
The requirement, that must stay in the range in a significant part of the time fails if we scale the process. This problem can be handled if we scale somehow the steps of the algorithm. Since the problem is in the step function , the steps has to reflect the scale of the process ( and are on the same scale). If we re-scale the variable then we need to compensate the term as well. Therefore the steps are the following:
[TABLE]
where equals to the standard deviation of . It could be an estimation of the standard deviation but for simplicity we used the whole dataset to estimate it.
The performance of using the scaling factor on both and is showed on Table 1!!! and Figure !!!. The table shows the Mean Squared Error at , while the figure shows the function with different scaling. Parameter settings of the table and the figure is defined below in (4.2) and (4.3). In Dataset-2, when is smaller, the Mean Squared Error is higher despite that the process’s variation is higher (in the AR(1) case the variation is ). This is because the lower the the less information we have, it is more difficult to learn. Figures 3 and 4 show thatwithout scaling the algorithm at first wait until achieves a suitable size, while using scaling speeds up this and the algorithm uses the appropriate ’s. The numerical results also show that the best way to scale the process is using the fivefold of the standard deviation of the process.
Dataset 1:
[TABLE]
Dataset 2:
[TABLE]
(In the AR(1) only make sense.)
Funding
The first author gratefully acknowledges the support of Új Nemzeti Kiválóság Program 2018/2019, of Ministry of Human Capacities (project number: ÚNKP-18-3-IV-PPKE-21). The second author acknowledges support from the "Lendület" grant LP 2015-16 of the Hungarian Academy of Sciences (Lendület grant LM 2015-16) and supported by the NKFIH (National Research, Development and Innovation Office, Hungary) grant KH 126505.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Kendall Kim. Electronic and algorithmic trading technology: the complete guide . Academic Press, 2010.
- 2[2] Irene Aldridge. High-frequency trading: a practical guide to algorithmic strategies and trading systems , volume 604. John Wiley & Sons, 2013.
- 3[3] Bin Li and Steven CH Hoi. Online portfolio selection: A survey. ACM Computing Surveys (CSUR) , 46(3):35, 2014.
- 4[4] Marcos Lopez De Prado. Advances in financial machine learning . John Wiley & Sons, 2018.
- 5[5] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics , pages 400–407, 1951.
- 6[6] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics , 23(3):462–466, 1952.
- 7[7] Zhenhua Zhang, G Yin, and Zhian Liang. A stochastic approximation algorithm for american lookback put options. Stochastic Analysis and Applications , 29(2):332–351, 2011.
- 8[8] G Yin, Qing Zhang, F Liu, RH Liu, and Y Cheng. Stock liquidation via stochastic approximation using nasdaq daily and intra-day data. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics , 16(1):217–236, 2006.
