Learning Threshold-Type Investment Strategies with Stochastic Gradient   Method

Zsolt Nika; Mikl\'os R\'asonyi

arXiv:1907.02457·q-fin.PM·July 5, 2019

Learning Threshold-Type Investment Strategies with Stochastic Gradient Method

Zsolt Nika, Mikl\'os R\'asonyi

PDF

Open Access

TL;DR

This paper introduces a stochastic gradient-based learning algorithm for threshold-type investment strategies in online portfolio optimization, demonstrating its convergence and effectiveness across various stock price models.

Contribution

It is the first systematic study applying the Kiefer--Wolfowitz stochastic gradient method to learn optimal threshold strategies in portfolio optimization.

Findings

01

The algorithm converges to the log-optimal threshold strategy.

02

Optimal threshold strategies exist across diverse stock price models.

03

Hyperparameter tuning can be effectively performed with limited data.

Abstract

In online portfolio optimization the investor makes decisions based on new, continuously incoming information on financial assets (typically their prices). In our study we consider a learning algorithm, namely the Kiefer--Wolfowitz version of the Stochastic Gradient method, that converges to the log-optimal solution in the threshold-type, buy-and-sell strategy class. The systematic study of this method is novel in the field of portfolio optimization; we aim to establish the theory and practice of Stochastic Gradient algorithm used on parametrized trading strategies. We demonstrate on a wide variety of stock price dynamics (e.g. with stochastic volatility and long-memory) that there is an optimal threshold type strategy which can be learned. Subsequently, we numerically show the convergence of the algorithm. Furthermore, we deal with the typically problematic question of how to…

Tables1

Table 1. Table 1 : Performance of the algorithm with different scaling of the steps a t subscript 𝑎 𝑡 a_{t} and c t subscript 𝑐 𝑡 c_{t} .

Scaling	AR(1)		DGSV
	Dataset-1 ( $\times 10^{- 6}$ )	Dataset-2 ( $\times 10^{- 5}$ )	Dataset-1 ( $\times 10^{- 6}$ )	Dataset-2 ( $\times 10^{- 5}$ )
No scaling	7.8	1.8	6.7	19.2
$K =$ st. dev. $(H_{t})$	11.4	17.9	53	120.0
$K =$ st. dev. $(H_{t}) 5$	1.7	1.6	20	8.8

Equations74

\frac{W _{t}}{W _{t - 1}} = (1 - π_{t}) \frac{B _{t}}{B _{t - 1}} + π_{t} \frac{S _{t}}{S _{t - 1}},

\frac{W _{t}}{W _{t - 1}} = (1 - π_{t}) \frac{B _{t}}{B _{t - 1}} + π_{t} \frac{S _{t}}{S _{t - 1}},

H_{t} := lo g (\frac{S _{t}}{S _{t - 1}}) .

H_{t} := lo g (\frac{S _{t}}{S _{t - 1}}) .

\frac{W _{t}}{W _{t - 1}} = 1 - π_{t} + π_{t} e^{H_{t}} .

\frac{W _{t}}{W _{t - 1}} = 1 - π_{t} + π_{t} e^{H_{t}} .

t \to \infty lim \frac{1}{t} E [lo g (W_{t})] .

t \to \infty lim \frac{1}{t} E [lo g (W_{t})] .

\tilde{g} := E [lo g (W_{t} / W_{t - 1}) ∣ F_{t - 1}] \to maximize,

\tilde{g} := E [lo g (W_{t} / W_{t - 1}) ∣ F_{t - 1}] \to maximize,

\tilde{g} := E [lo g (1 - π_{t} + π_{t} e^{H_{t}}) ∣ F_{t - 1}] \to maximize .

\tilde{g} := E [lo g (1 - π_{t} + π_{t} e^{H_{t}}) ∣ F_{t - 1}] \to maximize .

\tilde{g} (X) = E [lo g (1 - π_{t} + π_{t} e^{H_{t}}) ∣ X],

\tilde{g} (X) = E [lo g (1 - π_{t} + π_{t} e^{H_{t}}) ∣ X],

H_{t} = μ + α H_{t - 1} + σ e^{Y_{t}} (ρ ε_{t} + 1 - ρ^{2} η_{t}),

H_{t} = μ + α H_{t - 1} + σ e^{Y_{t}} (ρ ε_{t} + 1 - ρ^{2} η_{t}),

Y_{t} = j = 0 \sum \infty β_{j} ε_{t - j}, β_{j}, μ, σ \in R; α, ρ \in [- 1, 1] .

H_{t} = μ + α H_{t - 1} + σ ε_{t} : AR(1),

H_{t} = μ + α H_{t - 1} + σ ε_{t} : AR(1),

H_{t} = μ + j = 0 \sum \infty β_{j} ε_{t - j} : MA(\infty) .

H_{t} = μ + j = 0 \sum \infty β_{j} ε_{t - j} : MA(\infty) .

π_{t}^{l in} = {1, if E [H_{t} ∣ F_{t - 1}] > 0, 0, otherwise,

π_{t}^{l in} = {1, if E [H_{t} ∣ F_{t - 1}] > 0, 0, otherwise,

π_{t} = \mathbbm 1_{{f (past data) > 0}},

π_{t} = \mathbbm 1_{{f (past data) > 0}},

\tilde{g} = E [lo g (1 - \mathbbm 1_{{f (past data) > 0}} + \mathbbm 1_{{f (past data) > 0)}} e^{H_{t}} ∣ past data] .

\tilde{g} = E [lo g (1 - \mathbbm 1_{{f (past data) > 0}} + \mathbbm 1_{{f (past data) > 0)}} e^{H_{t}} ∣ past data] .

g (θ) := E [\tilde{g} (X_{t - 1}, θ)],

g (θ) := E [\tilde{g} (X_{t - 1}, θ)],

E [lo g (1 - π_{t} + π_{t} e^{H_{t}}) ∣ X_{t - 1}] .

E [lo g (1 - π_{t} + π_{t} e^{H_{t}}) ∣ X_{t - 1}] .

π_{t} := \mathbbm 1_{{X_{t - 1} > θ}} .

π_{t} := \mathbbm 1_{{X_{t - 1} > θ}} .

g (θ) = E [H_{t} \mathbbm 1_{{X_{t - 1} > θ}}] .

g (θ) = E [H_{t} \mathbbm 1_{{X_{t - 1} > θ}}] .

θ^{*} = {x ∣ ϕ (x) = 0} .

θ^{*} = {x ∣ ϕ (x) = 0} .

g (θ) = E [H_{t} \mathbbm 1_{{X_{t - 1} > θ}}] = E [E [H_{t} ∣ X_{t - 1}] \mathbbm 1_{{X_{t - 1} > θ}}] .

g (θ) = E [H_{t} \mathbbm 1_{{X_{t - 1} > θ}}] = E [E [H_{t} ∣ X_{t - 1}] \mathbbm 1_{{X_{t - 1} > θ}}] .

\int_{θ}^{\infty} v (y) f_{X} (y) d y .

\int_{θ}^{\infty} v (y) f_{X} (y) d y .

- v (y) f_{X} (y) = 0.

- v (y) f_{X} (y) = 0.

E [H_{t} ∣ H_{t - 1} = x] = μ + α x .

E [H_{t} ∣ H_{t - 1} = x] = μ + α x .

θ^{*} = - \frac{μ}{α} .

θ^{*} = - \frac{μ}{α} .

E [H_{t} ∣ H_{t - 1} = x] = μ + α x + σ ρ E [e_{t}^{Y} ∣ H_{t - 1}] .

E [H_{t} ∣ H_{t - 1} = x] = μ + α x + σ ρ E [e_{t}^{Y} ∣ H_{t - 1}] .

π_{t} = \mathbbm 1_{{H_{t - 1} + θ^{2} H_{t - 3} + θ^{3} H_{t - 3} + \dots > θ^{1}}} or

π_{t} = \mathbbm 1_{{H_{t - 1} + θ^{2} H_{t - 3} + θ^{3} H_{t - 3} + \dots > θ^{1}}} or

π_{t} = \mathbbm 1_{{H_{t - 1} + θ^{2} e^{ν_{t - 1}} > θ^{1}}},

\partial g / \partial θ^{1} = \int_{- \infty}^{\infty} v (θ^{1} - θ^{2} x, x) f (θ^{1} - θ^{2} x, x) d x = 0,

\partial g / \partial θ^{1} = \int_{- \infty}^{\infty} v (θ^{1} - θ^{2} x, x) f (θ^{1} - θ^{2} x, x) d x = 0,

\partial g / \partial θ^{2} = \int_{- \infty}^{\infty} - xv (θ^{1} - θ^{2} x, x) f (θ^{1} - θ^{2} x, x) d x = 0,

maximize_{θ} g (θ) := E [H_{t} \mathbbm 1_{{X_{t - 1} > θ}}],

maximize_{θ} g (θ) := E [H_{t} \mathbbm 1_{{X_{t - 1} > θ}}],

θ_{t + 1} = θ_{t} + a_{t} \frac{G ( θ _{t} - c _{t} ; H _{t} , X _{t - 1} ) - G ( θ _{t} + c _{t} ; H _{t} , X _{t - 1} )}{c _{t}},

θ_{t + 1} = θ_{t} + a_{t} \frac{G ( θ _{t} - c _{t} ; H _{t} , X _{t - 1} ) - G ( θ _{t} + c _{t} ; H _{t} , X _{t - 1} )}{c _{t}},

θ_{t + 1} = θ_{t} + a_{t} \frac{H _{t} \mathbbm 1 _{{X_{t - 1} \in [θ_{t} \pm c_{t}]}}}{c _{t}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Time Series Analysis · Stochastic processes and financial applications · Advanced Bandit Algorithms Research

Full text

Learning Threshold-Type Investment Strategies with Stochastic Gradient Method

Zsolt Nika

Faculty of Information Technology and Bionics

Pázmány Péter Catholic University

Budapest, Hungary

[email protected]

Miklós Rásonyi

Alfréd Rényi Institute of Mathematics

Hungarian Academy of Sciences

Budapest, Hungary

(June 2019)

Abstract

In online portfolio optimization the investor makes decisions based on new, continuously incoming information on financial assets (typically their prices). In our study we consider a learning algorithm, namely the Kiefer–Wolfowitz version of the Stochastic Gradient method, that converges to the log-optimal solution in the threshold-type, buy-and-sell strategy class.

The systematic study of this method is novel in the field of portfolio optimization; we aim to establish the theory and practice of Stochastic Gradient algorithm used on parametrized trading strategies.

We demonstrate on a wide variety of stock price dynamics (e.g. with stochastic volatility and long-memory) that there is an optimal threshold type strategy which can be learned. Subsequently, we numerically show the convergence of the algorithm. Furthermore, we deal with the typically problematic question of how to choose the hyperparameters (the parameters of the algorithm and not the dynamics of the prices) without knowing anything about the price other than a small sample.

Keywords

Stochastic gradient; Log-optimal investment; Online portfolio selection

1 Introduction

In investment there is an approach by technical analysts where the investment decision is based on past data such as price, technical indicators or trading volumes. The decisions are determined by some function of past data called trading rule or strategy function. In algorithmic trading, these decisions are executed automatically by computers [1]. One of the most typical subcategories of algorithmic trading is high-frequency trading, where favorable decisions must be made in seconds or even miliseconds [2].

Given the nature of these algorithms, those that require huge computational capacities are not efficient since they are slow. For example, non-parametric methods or complex machine learning algorithms work well on big data sets with immense computer efforts (see a survey of non-parametric methods e.g. in [3] and a summary in machine learning methods [4]). On the other hand, parametric models that are based on dynamics of the prices or indices may give fairly good results if precise and accurate parameter estimations are available. To get satisfactory estimations, again one needs a big data set, and typically decisions are sensitive to the error of the estimations.

To resolve the above-mentioned problems we use Kiefer–Wolfowitz method which lets us (i) make decisions immediately starting at the initial step; (ii) process new information/data as they arrive, without needing to wait until we have a big enough data set, as the strategy function improves in every step and (iii) there is no need to estimate the parameters of the dynamics. With this method, we aim to optimize log-utility investments (maximizing the expected value of the logarithm of the wealth). The method is also capable of tracking the changes of the market, which we ignore here in order to investigate the method itself in finance instead of market changes.

Stochastic Approximation [5] (or Robinson–Monro method) is an iterative method to find the root of $f(\theta):=\mathbb{E}[X_{t},\theta]=m$ , where $X_{t}$ is a stochastic process, $\theta$ is a prameter and $m$ is a constant. Basically, it is a stochastic version of the Newton–Raphson method where there are consecutive observations of the functions loaded with randomness/noise. If the derivate of the function exists then the method can be used for optimization. When the derivate does not exist or is unknown, Kiefer and Wolfowitz proposed [6] a finite difference approximation based on consecutive observations. This method is a version of the Stochastic Gradient method.

Nonetheless, stochastic approximation or Kiefer–Wolfowitz has not been used for directly optimizing the parametrized strategy as we do here. We hope that this introduction to the usage of the method in investment theory will develop further. Other works dealt with different approaches, like learning the parametrized stopping time for American/Asian options [7] or the optimal stopping time of liquidation [8]. Other typical fields of applying this algorithm are the estimation of quantiles for CVaRs [9], [10]or [11]. There is also a study about the optimal splitting of orders in [12].

In Section 2 we introduce the threshold strategies that can be parametrized in a way that it can be optimized by the Kiefer–Wolfowitz algorithm. Then in Section 3 we show how the algorithm works and that the optimum exists. In Section 4 we show numerical results how the algorithm performs and we also deal with the problem how to choose the hyperparameters of the algorithm in a suitable way. Throughout this article we make some usual simplifications in investment theory: the investment solely contains one risky and one riskless asset, of course, these can relaxed. Given the nature of the learning method, we only focus on discrete time models.

2 Threshold strategies in log-optimal investments

In this section, we introduce the financial background in which we want to apply the learning method in the next sections. At first, we start with some preliminary information about investment, after which we present our threshold type strategy and we discuss how it connects to the theory of log-optimal investments.

2.1 Portfolio

Portfolio investment, mathematically speaking, is an applied field of control theory where the control process is the investor’s decision regarding in which asset to allocate her/his current wealth, and the independent processes are typically the prices of the financial assets. Logarithmic utility function is used frequently as an objective function for several reasons.

Let us denote the riskless process $B_{t}$ and the risky asset $S_{t}$ , where $t\in\mathbb{N}$ is the discrete-time parameter. In this study, we do not want to focus on the effect of the interest rate, therefore, the riskless process is chosen to be constant (it assumes zero interest rate, so we do not need to discount the prices). The value of the portfolio $W_{t}$ is the wealth of the invester and its time-evolution is typically written as

[TABLE]

where $\pi_{t}\in[0,1]$ is an $\mathcal{F}_{t-1}$ -measurable function, called strategy, i.e. the fraction of how much of the current wealth should be split between the two assets. Clearly, $\pi_{t}$ can only be a function of information up to $t-1$ since the investor is not able to look in the future. In financial mathematics, the log increment of the price has several well-established properties mathematically, which are called the stylized facts of stock prices [13]. It is convenient to build a dynamics on the log-return and not on the stock price. The log-return

[TABLE]

Since the riskless asset’s price is constant (therefore their fraction is one) we can simplify the wealth as

[TABLE]

The investor’s objective is to maximize the utility function

[TABLE]

It has been showed in [14] that it can be maximized if the strategy $\pi_{t}$ is chosen such as to maximize the conditional expectation of the growth

[TABLE]

or with our financial conditions it equals to

[TABLE]

The conditional expectation $\tilde{g}$ is a random variable and measurable on $\mathcal{F}_{t-1}$ . The condition on $\mathcal{F}_{t-1}$ contains a lot more information that is accessible for an investor or anyone. In an algorithm we need to specify what information we use (for example past prices or stock market indices) therefore we can only optimize a conditional mean where the condition is a random variable. We denote it as

[TABLE]

where $X$ is an $\mathcal{F}_{t-1}$ -measurable (multivariate) random variable.

In the following sections we show how to parametrize the strategy process $\pi_{t}$ to be able to learn the log-optimal strategy and then how to choose the variable $X$ .

2.2 Dynamics

The present method can be used on several type of stock price dynamics. It is important to use such dynamics where (i) the optimal strategy exists and results in a portfolio which achieves its optimality and (ii) the price dynamics is realistic, plausible. For this reason we rely on the time series class introduced in [15] called Conditionally Gaussian and one of its example, the Discrete Gaussian Stochastic Volatility (DGSV):

[TABLE]

This stock price model posseses several desirable properties: its statistical moments and auto-correlation function are realistic, includes long-memory and leverage effect as well. The existence of the log-optimal solution is provided in [15].

We also use simplier models to understand better the behavior of the algorithm. Such as AR(1) or MA( $\infty$ ) processes:

[TABLE]

The coefficients $\beta_{j}:=b_{0}(1+j)^{-b}$ and choosing $b_{0}>0$ and $0.5<b<1$ ensure that 2.7 has long memory.

2.3 Threshold strategy

The log-optimal strategy in (2.3) only can be calculated if the parameters of the stock price dynamics are known. The exact form of the strategy in unknown, one need to use numerical integration to get the optimal decision at every timestep $t$ .

In Section 3 of [15] an approximative strategy of the log-optimal was proposed. They showed that on realistic data it performs well, though they did not give mathematical estimation of the error. This approximative strategy reduces the space of possible decisions from $\pi_{t}\in[0,1]$ to two states $\pi\in\{0,1\}$ . With realistic log-return data this restriction does not result in a considerable loss and it can be used with learning algorithms while the log-optimal solution can’t.

The idea can be used for any parametric dynamics if the conditional expectation can be calculated. The proposed approximative strategy in [15] is

[TABLE]

which is a consequence of the requirement in (2.3) with first-order Taylor-expansion. That is, the investor should buy only risky asset if its conditional expected value is higher than 0. This strategy lies in the field of threshold strategy.

We remark, that we are working now in 0 interest rate environment. Without this assumption the trading rule modifies to buy whenever the conditional expectation is higher than the interest rate.

Because of the structure of the strategy, we call it here threshold strategy. We do not need the upperscript $lin$ since we are only investigating this type of strategy with the Stochastic Gradient method.

In most parametric models the conditional expectation can be calculated therefore we end up with a function of past data that we call here threshold-function: $f(\text{past data}):=\mathbb{E}[H_{t}|\text{past data}]$ . An equivalent form of 2.8 using the threshold function is

[TABLE]

where the function $\mathbf{1}_{\{x>0\}}$ is 1 if $x>0$ and 0 otherwise. The conditional expectation of the growth (2.3) that we want to optimize here with the indicator function is

[TABLE]

This function is still a random variable because it is a function of past data.

In the following subsections we unfold some cases how to handle "past data", but of course, it is the investors duty to tell, which past values to use. Proposition 2.1 gives help how and what to take into consideration when someone chooses values from past data.

With Stochastic Gradient method we are able to optimize an expected value with respect to some parameters. Therefore in the following we will optimize the exptected value of $\tilde{g}$ . If we parametrize the conditional growth by $\theta$ which is a one or multivariate real number, than the optimization task is to find the maximum of the growth

[TABLE]

where $\tilde{g}(X_{t-1},\theta)$ is a parametrized version of (2.3).

2.4 Markovian strategy

Let us assume the investor uses only one value that is available before investing at time $t$ and call this variable $X_{t-1}$ . It can be past stock returns or an index or something more complex, for example the weighted average of the past returns. A natural choice can be the previous value of the return, that is $H_{t-1}$ and we stick to this simple case here.

The conditional growth in (2.3) with Markovian strategy:

[TABLE]

We need to parametrize the threshold function in the strategy to be able to use it with Stochastic Gradient method. A convenient choice is the linear function; in this paper we do not relieve this restriction but we mention that $X_{t-1}$ can be a function of $H_{t-1}$ though.

[TABLE]

The optimizable growth in (2.11) is

[TABLE]

In Section 3 we will optimize this function with the Kiefwer–Wolfowitz method.

The theorem below shows the optimal threshold ( $\theta^{*}$ ) of the Markovian strategy.

Proposition 2.1.

Let us assume that there is only one root of the differentiable function $\phi(x):={\mathbf{E}[H_{t}|X_{t-1}=x]}$ and that $\phi(x)>0$ if $x>0$ . Moreover, let us assume that the return process is stationary. Then the root of $\phi(x)$ is the unique optimal threshold:

[TABLE]

Proof.

For the sake of simplicity assume that $X_{t-1}$ has a pdf. The conditional expectation of the growth is

[TABLE]

Since $\mathbb{E}[H_{t}|X_{t-1}]$ is a function of $X_{t-1}$ , call it $v(X_{t-1})$ and denote the pdf of $X_{t-1}$ as $f_{X}(x)$ , the expected value is

[TABLE]

The integral has optimum where

[TABLE]

Since $f_{X}(y)$ is non-negative therefore the optimal threshold is where $v(y)=0$ which conclude our statement.

∎

Remark 1.

The main message of the theorem is that only those information can be used in the optimization algorithm which are not mean-independent [16] from the price process. The concept of mean independence is well-known in econometrics which is a stronger property than uncorrelation but weaker than the stochastic independence.

Remark 2.

A conclusion of Proposition 2.1 is that the linear approximative strategy is log-optimal if the strategy can only be 0 or 1. This is only true in the univariate case.

For a simple example, let us model the log-return as an autoregressive process and let us use the previous log-return value as "past data".

Example 2.1 (AR(1)).

Let $H_{t}$ defined as in (2.6). The conditional expectation is

[TABLE]

Its root, that is the optimal threshold is

[TABLE]

When $\alpha<0$ , then the assumption in Theorem 2.1 about $\phi(x)>0$ if $x>0$ is false, but the optimality is true if we change the inequality sign in (2.13) to $\pi_{t}:=\mathbbm{1}_{\{X_{t-1}<\theta\}}$ .

(We remind the reader, that the expected log-return is different, $\mu/(1-\alpha)$ .)

As we can see from the example, to determine the threshold we either need to estimate $\mu$ and $\alpha$ from a long enough sample or either we learn the value of $\theta^{*}$ by using Stochastic Gradient. In a more realistic dynamics there are more than two parameters that needed to be estimated. Furthermore the threshold is very sensitive to the estimation error of $\alpha$ .

Example 2.2 (DGSV).

Let the log-return $H_{t}$ be a DGSV process according to (2.5). Its conditional expectation is

[TABLE]

The conditional expectation is unknown but we will see later in the numerical results that there is a unique solution.

2.5 Non-Markovian strategy - multivariate case

If the investor rather would like to use more information for example to handle long memory or information about volatility, it is also possible. We show here two possible choices that can be used, one strategy uses multiple past return data, the other one uses volatility information as extra. The strategies

[TABLE]

where $\theta^{1},\theta^{2},\dots$ are the parameters we wish to optimize and $\nu_{t-1}$ is an estimation of the logarithm of the volatility based on the information of $\mathcal{F}_{t-1}$ (that is, $\nu_{t-1}:=\mathbb{E}[Y_{t}|\mathcal{F}_{t-1}]$ ). The design of the second strategy with the log-volatility may seem peculiar but the linear approximation strategy of the log-optimal in [15] has been showed that it is a linear function of $H_{t-1}$ and $\nu_{t-1}$ .

An important aspect of the strategy choice with volatility, is that we are able to catch leverage effect with it. As we noted in Remark 1, only those processes should be used in the threshold function which are not mean-indepenent of the log-return. Leverage effect is defined in several ways, anyhow it is a connection between stock price change and past volatility (i.e. in our case between $H_{t}$ and $\nu_{t-1}$ ). Noises in the price that have no leverage effect, for example the noise term $\eta_{t}$ in 2.5, have no advantagesin the investment.

Leverage effect has a prominent role, since it is the only way how we can utilize volatility but the long memory typically appears in volatility. As it has been show in [13], the long memory is hidden in volatility and not in the drift part of the process.

In the multivariate case there is no closed form of the optimal $\theta^{i}$ values. Of course, the $\partial g/\partial\theta^{i}=0$ must be satisfied. For example, in two dimensions version of (2.16a) the optimal $\theta$ ’s must satisfy the

[TABLE]

equations, where $v(x,y):=\mathbb{E}[H_{t}|H_{t-1}=x,H_{t-2}=y]$ and $f(x,y)$ is the joint pdf of $(H_{t-1},H_{t-2})$ . The equations are more complicated in the DGSV case if we wish to include the log-volatility $\nu_{t-1}$ then we need to replace the variable $x\rightarrow\exp(x)$ and reinterpret the pdf and conditional mean (by using $e^{\nu_{t-1}}$ instead of $H_{t-2}$ . These are unknown functions in general and we could only estimate the pdf and the conditional expectation based on data which is contrary to our goals.

It does not mean that the Kiefer–Wolfowitz algorithm cannot converge to the optimal $\theta$ ’s, only that we cannot calculate their optimal values in advance. If the dynamics are known then Monte-Carlo method can be used to estimate the optimal value. This is what we use in the numerical simulations.

Here we would like to show the basics of how to use the Kiefer–Wolfowitz algorithm for investment purposes. Other processes could also be used.

3 Kiefer–Wolfowitz algorithm

With the Kiefer-Wolfowitz optimization procedure we are searching for the maximum of (2.11).

Univariate case:

the task is to find the optimum threshold $\theta^{*}\in\mathbb{R}$

[TABLE]

the random processes $H_{t}$ and $X_{t-1}$ are both univariate. Let us denote the growth at time $t$ by $G(\theta;H_{t},X_{t-1}):=H_{t}\mathbbm{1}_{\{X_{t-1}>\theta\}}$ . The Stochastic Gradient algorithm uses the finite differences of the growth:

[TABLE]

where the step-size $a_{t}$ and the step-size of the finite difference $c_{t}$ are real-valued sequences. The fraction is the approximation of the gradient.

Since the growth $G(\theta;\dots)$ is the indicator function of $\theta$ , therefore its finite difference can be simplified to a range. For greater clarity we denote the range $[x-c,x+c]$ as $[x\pm c]$ . Then the algorithm can be written as

[TABLE]

This formalism will help us in the latter to better understand the usage of the method.

It is impossible to prove in general but via some examples in the Section 4 we show nuerically that this recursive update converges to the optimum what we showed in the previous section:

[TABLE]

the convergence is in $L^{2}$ , i.e. we can show the convergence of the Mean Squared Error (MSE). If the convergence is accomplished, its speed has power-law typically.

In general, there is no straightforward way to choose the hyperparameters. In Section 4 we show some ideas on which basis we can choose the hyperparameters.

In real life investment the financial environment is not static, the dynamics of prices can change and new factors can appear/disappear, therefore optimal strategy changes as well. To this end, in practice investors use constant and very small step sizes $a_{t}$ and $c_{t}$ which able to track down the changes of the optimal values. In this paper we do not aim to focus on changes of the market.

Multivariate case:

the algorithm works in the same way, each dimension of the parameter are updated separatly with no cross-effect. For example in the case of known log-volatility (2.16b) the growth is ${G(\theta^{1},\theta^{2};H_{t},H_{t-1},\nu_{t-1})}$

[TABLE]

4 Numerical Results

The critical part of every algorithm is the choice of the hyperparameters. In their paper, J. Kiefer and J. Wolfowitz [6] also address the issue of parameter-choice though they were able to give exact and sufficient conditions in a simplier context. These conditions are typical requirements and our model satisfy them as well:

$c_{t}\rightarrow 0$ . 2. 2.

$\sum_{t=1}^{\infty}a_{t}=\infty$ , that is, the algorithm can reach any state. 3. 3.

$\sum_{t=1}^{\infty}a_{t}c_{t}<\infty$ . 4. 4.

$\sum_{t=1}^{\infty}a_{t}^{2}c_{t}^{-2}<\infty$ .

A usual first guess choice is $a_{t}=t^{-1}$ and $c_{t}=t^{-1/3}$ .

Analyzing the growth function $g(\theta)$ in the univariate case help us to construct the step-sizes in a suitable way. Figure 1 and (3.3) make it clear that $\theta_{t}$ must stay in the same range as $X_{t-1}$ , since $X_{t-1}\not\in[\theta_{t}\pm c_{t}],\,\forall t\in\mathbb{N}$ would result in constant $\theta_{t}$ . In the numerical simulations we only show results about the $X_{t-1}:=H_{t-1}$ case. On the two example we can make the following remarks:

•

$g(\theta\rightarrow-\infty)=\mathbb{E}[H_{t}]$ , low $\theta$ means that $\pi_{t}=1$ , that is, the wealth equals to the price of the stock.

•

$g(\theta\rightarrow\infty)=0$ , high $\theta$ means that $\pi=0$ , the wealth equals to the price of the bond.

•

In the simple case when $H_{t}$ is an autoregressive process and also when it has the more complex, realistic dynamics DGSV, there is a unique $\theta^{*}$ that can be calculated.

•

If $\theta_{t}$ takes value out of the typical value of $H_{t}$ where the derivate of $g(\theta)$ is zero then it is hopeless for the algorithm to return and it stays there.

To overcome on the problem of the last remark we make some modifications on the algorithm. First, the inital value $\theta_{0}$ must be estimated on a small sample of $H_{t}$ . In every realization we used 10 data points to initialize $\theta_{0}:=\sum_{t=1}^{1}0H_{t}/10$ . This very small sample is already enough for the algorithm to start from a relatively good point. Second, we cannot let the algorithm to take any large step. A general solution for this is to use a project $\theta_{t}$ on a subspace. In our case we do a truncation on the known range of $H_{t}$ :

[TABLE]

where $\tilde{\theta}:=\theta_{t-1}+a_{t}H_{t}\mathbbm{1}_{\{H_{t-1}\in[\theta_{t-1}\pm c_{t-1}]\}}/c_{t-1}$ .

Using the simple parametrization $a_{t}=t^{-1}$ and $c_{t}=t^{-1/3}$ can work in a simple setting. Figure 2 show hot the simple choice of the hyperparameters work. The simulations were executed with $N=25$ realizations and for $T=50\,000$ time steps. The Mean Squared Error (MSE) is an approximation of the $L^{2}$ error. The log-log scale plot of the error shows that MSE has power law decays in both cases.

The requirement, that $H_{t-1}$ must stay in the range $[\theta_{t}\pm c_{t}]$ in a significant part of the time fails if we scale the process. This problem can be handled if we scale somehow the steps of the algorithm. Since the problem is in the step function $H_{t-1}\in[\theta_{t}\pm c_{t}]$ , the steps $c_{t}$ has to reflect the scale of the process ( $H_{t-1}$ and $\theta_{t}$ are on the same scale). If we re-scale the $c_{t}$ variable then we need to compensate the $a_{t}/c_{t}$ term as well. Therefore the steps are the following:

[TABLE]

where $K$ equals to the standard deviation of $H_{t}$ . It could be an estimation of the standard deviation but for simplicity we used the whole dataset to estimate it.

The performance of using the scaling factor $K$ on both $a_{t}$ and $c_{t}$ is showed on Table 1!!! and Figure !!!. The table shows the Mean Squared Error at $t=T=100\,000$ , while the figure shows the function $t\rightarrow MSE_{t}$ with different scaling. Parameter settings of the table and the figure is defined below in (4.2) and (4.3). In Dataset-2, when $\alpha$ is smaller, the Mean Squared Error is higher despite that the process’s variation is higher (in the AR(1) case the variation is $\sigma^{2}/(1-\alpha^{2})$ ). This is because the lower the $\alpha$ the less information we have, it is more difficult to learn. Figures 3 and 4 show thatwithout scaling the algorithm at first wait until $c_{t}$ achieves a suitable size, while using scaling speeds up this and the algorithm uses the appropriate $c_{t}$ ’s. The numerical results also show that the best way to scale the process is using the fivefold of the standard deviation of the process.

Dataset 1:

[TABLE]

Dataset 2:

[TABLE]

(In the AR(1) only $\mu,\alpha,\sigma$ make sense.)

Funding

The first author gratefully acknowledges the support of Új Nemzeti Kiválóság Program 2018/2019, of Ministry of Human Capacities (project number: ÚNKP-18-3-IV-PPKE-21). The second author acknowledges support from the "Lendület" grant LP 2015-16 of the Hungarian Academy of Sciences (Lendület grant LM 2015-16) and supported by the NKFIH (National Research, Development and Innovation Office, Hungary) grant KH 126505.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Kendall Kim. Electronic and algorithmic trading technology: the complete guide . Academic Press, 2010.
2[2] Irene Aldridge. High-frequency trading: a practical guide to algorithmic strategies and trading systems , volume 604. John Wiley & Sons, 2013.
3[3] Bin Li and Steven CH Hoi. Online portfolio selection: A survey. ACM Computing Surveys (CSUR) , 46(3):35, 2014.
4[4] Marcos Lopez De Prado. Advances in financial machine learning . John Wiley & Sons, 2018.
5[5] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics , pages 400–407, 1951.
6[6] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics , 23(3):462–466, 1952.
7[7] Zhenhua Zhang, G Yin, and Zhian Liang. A stochastic approximation algorithm for american lookback put options. Stochastic Analysis and Applications , 29(2):332–351, 2011.
8[8] G Yin, Qing Zhang, F Liu, RH Liu, and Y Cheng. Stock liquidation via stochastic approximation using nasdaq daily and intra-day data. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics , 16(1):217–236, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning Threshold-Type Investment Strategies with Stochastic Gradient Method

Abstract

Keywords

1 Introduction

2 Threshold strategies in log-optimal investments

2.1 Portfolio

2.2 Dynamics

2.3 Threshold strategy

2.4 Markovian strategy

Proposition 2.1**.**

Proof.

Remark 1**.**

Remark 2**.**

Example 2.1** (AR(1)).**

Example 2.2** (DGSV).**

2.5 Non-Markovian strategy - multivariate case

3 Kiefer–Wolfowitz algorithm

Univariate case:

Multivariate case:

4 Numerical Results

Funding

Proposition 2.1.

Remark 1.

Remark 2.

Example 2.1 (AR(1)).

Example 2.2 (DGSV).