A Deep Reinforcement Learning Trader without Offline Training

Boian Lazov

arXiv:2303.00356·q-fin.CP·September 30, 2025

A Deep Reinforcement Learning Trader without Offline Training

Boian Lazov

PDF

TL;DR

This paper presents a fully online trading algorithm using Double Deep Q-learning with Fast Learning Networks, capable of adapting to market conditions without offline training, and demonstrates its effectiveness on cryptocurrency data.

Contribution

The paper introduces a novel online trading method that does not require offline training, using a specific reinforcement learning setup with a profit conservation mechanism.

Findings

01

The algorithm outperforms random trading strategies.

02

It adapts well to different market trends.

03

It performs effectively on real cryptocurrency data.

Abstract

In this paper we pursue the question of a fully online trading algorithm (i.e. one that does not need offline training on previously gathered data). For this task we use Double Deep $Q$ -learning in the episodic setting with Fast Learning Networks approximating the expected reward $Q$ . Additionally, we define the possible terminal states of an episode in such a way as to introduce a mechanism to conserve some of the money in the trading pool when market conditions are seen as unfavourable. Some of these money are taken as profit and some are reused at a later time according to certain criteria. After describing the algorithm, we test it using the 1-minute-tick data for Cardano's price on Binance. We see that the agent performs better than trading with randomly chosen actions on each timestep. And it does so when tested on the whole dataset as well as on different subsets, capturing…

Tables4

Table 1. Table 1 : Sample means, medians and standard deviations of t w t h 𝑡 𝑤 𝑡 ℎ twth when taking random or non-random actions and of s a v 𝑠 𝑎 𝑣 sav for the full dataset. The probability for finishing a run with t w t h ≤ 100 𝑡 𝑤 𝑡 ℎ 100 twth\leq 100 is also included.

	Mean [USDT]	Median [USDT]	St. Dev. [USDT]	$P (t w t h \leq 100)$
Non-random $t w t h$	263.928	241.903	113.334	0.031
Random $t w t h$	189.703	153.028	121.777	0.226
$s a v$	68.192	64.020	32.131	$-$

Table 2. Table 2 : Descriptive statistics of t w t h 𝑡 𝑤 𝑡 ℎ twth when taking random or non-random actions and of s a v 𝑠 𝑎 𝑣 sav for the “bearish” dataset.

	Mean [USDT]	Median [USDT]	St. Dev. [USDT]	$P (t w t h \leq 100)$
Non-random $t w t h$	$78.999$	$77.538$	$16.689$	$0.884$
Random $t w t h$	$76.139$	$74.905$	$15.169$	$0.926$
$s a v$	$3.060$	$1.982$	$3.324$	$-$

Table 3. Table 3 : Descriptive statistics for the “bullish” dataset.

	Mean [USDT]	Median [USDT]	St. Dev. [USDT]	$P (t w t h \leq 100)$
Non-random $t w t h$	$152.267$	$150.601$	$25.372$	$0.007$
Random $t w t h$	$145.245$	$142.256$	$28.561$	$0.027$
$s a v$	$10.169$	$9.758$	$4.755$	$-$

Table 4. Table 4 : Descriptive statistics for the “mixed” dataset.

	Mean [USDT]	Median [USDT]	St. Dev. [USDT]	$P (t w t h \leq 100)$
Non-random $t w t h$	$104.338$	$100.545$	$26.771$	$0.491$
Random $t w t h$	$93.202$	$88.569$	$26.994$	$0.664$
$s a v$	$11.182$	$10.610$	$6.039$	$-$

Equations48

nm d_{i}^{(1)}

nm d_{i}^{(1)}

nm d_{k}^{(2)}

nm d_{l}^{(3)}

nm d^{(4)}

f e a t =

f e a t =

\frac{v o l _{5} - c a v}{c a v}, r s i, nm d_{1}^{(1)}, ..., nm d_{4}^{(1)}, nm d_{1}^{(2)}, ..., nm d_{3}^{(2)}, ..., nm d^{(4)}, m l im)^{T} .

w t h = m o n + p r_{5} c n s .

w t h = m o n + p r_{5} c n s .

r e w = (w t h_{t + 1} - w t h_{t}) - (\frac{w t h _{t + 1} - w t h _{t}}{2})^{2} .

r e w = (w t h_{t + 1} - w t h_{t}) - (\frac{w t h _{t + 1} - w t h _{t}}{2})^{2} .

r e w = (w t h_{t + 1} - w t h_{t}) - (\frac{w t h _{t + 1} - w t h _{t}}{2})^{2} - 0.1.

r e w = (w t h_{t + 1} - w t h_{t}) - (\frac{w t h _{t + 1} - w t h _{t}}{2})^{2} - 0.1.

m o n_{t + 1} > m l im,

m o n_{t + 1} > m l im,

m o n_{t + 1} < m l im;

m o n_{t + 1} < m l im;

w t h_{t + 1} < m l imn;

Q (s_{t}, a_{t}) > 0;

r s i_{t + 1} > 70.

m o n_{t + 1} < m l im;

m o n_{t + 1} < m l im;

w t h_{t + 1} \geq m l imn;

Q (s_{t}, a_{t}) < 0;

r s i_{t + 1} < 30.

ε = \frac{1}{ln ( 5 i _{ε} + 2 )} .

ε = \frac{1}{ln ( 5 i _{ε} + 2 )} .

α = α_{min} + \frac{1}{2} (α_{ma x} - α_{min}) (1 + cos (\frac{i _{α}}{T _{α}} π)),

α = α_{min} + \frac{1}{2} (α_{ma x} - α_{min}) (1 + cos (\frac{i _{α}}{T _{α}} π)),

X = x_{1} x_{2} ... x_{n}, Y = y_{1} y_{2} ... y_{m},

X = x_{1} x_{2} ... x_{n}, Y = y_{1} y_{2} ... y_{m},

g (Z) = g (z_{1}) g (z_{2}) ... g (z_{r}) .

g (Z) = g (z_{1}) g (z_{2}) ... g (z_{r}) .

W^{oi} = w_{11}^{oi} ... w_{m 1}^{oi} ... ... w_{1 n}^{oi} w_{mn}^{oi}, W^{hi} = w_{11}^{hi} ... w_{r 1}^{hi} ... ... w_{1 n}^{hi} w_{r n}^{hi}, W^{oh} = w_{11}^{oh} ... w_{m 1}^{oh} ... ... w_{1 r}^{oh} w_{m r}^{oh} .

W^{oi} = w_{11}^{oi} ... w_{m 1}^{oi} ... ... w_{1 n}^{oi} w_{mn}^{oi}, W^{hi} = w_{11}^{hi} ... w_{r 1}^{hi} ... ... w_{1 n}^{hi} w_{r n}^{hi}, W^{oh} = w_{11}^{oh} ... w_{m 1}^{oh} ... ... w_{1 r}^{oh} w_{m r}^{oh} .

y_{k} = s = 1 \sum n w_{k s}^{oi} x_{s} + l = 1 \sum r w_{k l}^{oh} g (t = 1 \sum n w_{l t}^{hi} x_{t}) .

y_{k} = s = 1 \sum n w_{k s}^{oi} x_{s} + l = 1 \sum r w_{k l}^{oh} g (t = 1 \sum n w_{l t}^{hi} x_{t}) .

Y = W^{oi} X + W^{oh} g (W^{hi} X) .

Y = W^{oi} X + W^{oh} g (W^{hi} X) .

(w_{k 1}^{oi}, w_{k 2}^{oi}, ..., w_{k n}^{oi}, w_{k 1}^{oh}, w_{k 2}^{oh}, ..., w_{k r}^{oh})^{T}

(w_{k 1}^{oi}, w_{k 2}^{oi}, ..., w_{k n}^{oi}, w_{k 1}^{oh}, w_{k 2}^{oh}, ..., w_{k r}^{oh})^{T}

X = f e a t, Y = Q (s, a_{1}) Q (s, a_{2}) ... Q (s, a_{19}) .

X = f e a t, Y = Q (s, a_{1}) Q (s, a_{2}) ... Q (s, a_{19}) .

g (z) = \frac{1}{1 + e ^{- z}},

g (z) = \frac{1}{1 + e ^{- z}},

tw t h = w t h + s a v + r es,

tw t h = w t h + s a v + r es,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest

Full text

A Deep Reinforcement Learning Trader without Offline Training

Boian Lazov [email protected] Department of Mathematics, University of Architecture, Civil Engineering and Geodesy, 1164 Sofia, Bulgaria

Abstract

In this paper we pursue the question of a fully online trading algorithm (i.e. one that does not need offline training on previously gathered data). For this task we use Double Deep $Q$ -learning in the episodic setting with Fast Learning Networks approximating the expected reward $Q$ . Additionally, we define the possible terminal states of an episode in such a way as to introduce a mechanism to conserve some of the money in the trading pool when market conditions are seen as unfavourable. Some of these money are taken as profit and some are reused at a later time according to certain criteria. After describing the algorithm, we test it using the 1-minute-tick data for Cardano’s price on Binance. We see that the agent performs better than trading with randomly chosen actions on each timestep. And it does so when tested on the whole dataset as well as on different subsets, capturing different market trends.

1 Introduction

In recent years algorithmic trading on financial markets is increasingly replacing humans [1]. One can find numerous estimates for the market share of automated traders with some sources giving over $73\%$ for US equity trading [1], while others citing as high as $92\%$ for forex trading.111See for example https://www.quantifiedstrategies.com/what-percentage-of-trading-is-algorithmic/.

Such sources, however, do not seem that reliable, since data is generally not openly available. Nevertheless, there are many (paid) reports that give a general idea of the scope of automated trading:

https://www.grandviewresearch.com/industry-analysis/algorithmic-trading-market-report,

https://www.mordorintelligence.com/industry-reports/algorithmic-trading-market,

https://www.alliedmarketresearch.com/algorithmic-trading-market-A08567.. Regardless of the actual figures, intelligent automation is increasingly used in our world and promises to be applicable in some very complex domains, where analytic solutions are either not known or very hard to obtain.

There are many possible approaches to developing a trading algorithm, but recently one direction of research has been receiving much attention, namely machine learning based approaches. In particular, Reinforcement Learning (RL) has been a really promising way to solve some very difficult problems in other areas (like learning to play various games like Go [2] and StarCraft II [3] at the expert level) and is now being adapted to make decisions and execute trades in the trading setting. This area of research is very active and fairly new (see for example [4]).

As promising as it is, RL suffers from one big problem, namely the generalisation one (as does all of machine learning in fact). More specifically, once the agent (or neural network) is trained on a set of data and good performance is achieved, it is generally hard to translate this training to a new dataset and keep the performance. Furthermore, the training set usually needs to be very large and the agent needs to replay it many times. This is obviously not a good situation for a trading algorithm, since the market is considered a stochastic system and as such it changes rapidly and continuously. If we hope to be able to predict its movement, it should mostly be short term. For this the trader should be able to adapt quickly to current information.

There are many proposed ways to try to deal with said problem of generalisation (both in supervised learning and RL), but one that seems both promising and simple is the idea to learn only the output weights of a neural network. It is implemented in partiular in two algorithms – Extreme Learning Machine (ELM) [5, 6] and Fast Learning Network (FLM) [7], and one can borrow the structure of the FLN to use as an approximator to the $Q$ -function. As we will see later this will be a success and we will obtain a RL agent that performs better than random even when the market’s overall trend is downward.

This paper is organised as follows: in section 2 we will describe the algorithm in detail, namely all of the components of the RL (section 2.1) as well as the neural network (section 2.2); in section 3 we will discuss briefly how to implement the algorithm and then we will test it on historical market data against an algorithm, that takes random actions on each timestep; finally, we will end with some concluding remarks (section 4).

2 The algorithm

Our goal is to build a simple trading algorithm, which uses a pool of money to trade for some asset, by observing the state of the market. More precisely our agent needs to collect some information about the market, then open a position (ideally, executing a trade). Then it needs to wait for more information to calculate whether the trade was a good one and the process repeats.

We will build our trader in the framework of RL. More precisely we will use a double $Q$ -learning algorithm with approximation [8]. It is generally thought that combining $Q$ -learning with approximation should be avoided, because of instabilities, but there are examples of successfully using such algorithms [9, 10]. To approximate the $Q$ -functions we will use the structure of a FLN [7]. On top of that we will also propose a savings mechanism to deal with “bearish” markets. The idea is to take money out of the trading pool, so that part of it is never used again and can be taken as profit and another part is returned to the trading pool when the market conditions seem favourable. We will now go through all the specific components one by one.

2.1 Double Q-learning

2.1.1 State

We will use a standard double Q-learning algorithm. It will be episodic with a continuous state space. As is usual, we will add an index $t$ to variables to denote the current time step and $t+1$ for the next time step. First we will see how to construct the state of the environment. As we will be using approximaiton of the $Q$ -function by a neural network the state will be defined by a feature vector.

The algorithm is initialised with 3 pools of money, denoted by $mon$ , $sav$ and $res$ . $mon$ refers to the current pool with which to trade. $sav$ denotes an amount of money, that are saved and never again used. This gives a convenient way to use the profit, without disturbing the operation of the trader and also safeguards somewhat against big losses. $res$ refers to a pool of money, that are stored for later use when some conditions are met. Finally, there are also the assets that the agent will buy, denoted by $cns$ . $mon$ and $cns$ will be included in the calculation of the state.

Next, we need a variable, which will be used to determine when to move money to $sav$ and $res$ . We denote this by $mlim$ . We will describe this in more detail later, but briefly $sav$ and $res$ will be increased (and $mon$ decreased), when the value of $mon$ becomes greater than $mlim$ .

To calculate the state, our trader also needs information for the market. It consists of one price, recorded at the beginning of the episode, denoted by $ipr$ , and $5$ consecutive prices, denoted by $pr_{1}$ , $pr_{2}$ , $pr_{3}$ , $pr_{4}$ and $pr_{5}$ , recorded at some intervals.

The next thing that is needed to calculate the state is the trading volume. More precisely, the trader records the volumes from the intervals immediately preceding the ones with recorded prices. These $5$ volumes are then used to calculate a few averages, that will be included in the feature vector. These are a simple moving average of the previous $100$ values of the volume, denoted by $av$ as well as the average volume of the current state’s data points, denoted by $cav$ . The last recorded volume (associated with $pr_{5}$ ) is also used in the feature vector and it is denoted by $vol_{5}$ .

Finally, we also include in the feature vector the Relative Strength Index (RSI), calculated with the recorded prices (15 prices, as is standard), as well as the relative movements of the price and relative movements of the movements and so on, i.e.

[TABLE]

With all of the above the feature vector has the following form:

[TABLE]

2.1.2 Actions and rewards

Next, we want to define the set of actions that the agent can take. They are simple – buy, sell or hold. We denote the set of actions as $\{a_{1},a_{2},...,a_{19}\}$ . Actions $a_{1}$ to $a_{9}$ denote buying. Action $a_{1}$ means the agent buys coins for $10$ money, action $a_{2}$ – for $20$ and so on. Actions $a_{10}$ to $a_{18}$ denote selling for a fixed amount of money – again at increments of $10$ . Finally, action $a_{19}$ denotes holding. Actions, of course, can fail due to insufficient funds and this will be reflected in the reward.

This leads us to the next ingredient of the RL algorithm – the reward signal. In order to calculate the reward we first need to keep a record of the total wealth $wth$ of the agent, associated with a given state,

[TABLE]

Thus after trading and recording the next $5$ prices, the wealth changes. Using this, we choose the reward to be polynomial in the change of the wealth, i.e.

[TABLE]

The above formula helps to discourage too risky actions (i.e. ones with changes in the wealth that are too big). Additionally, if the attempted action fails, the reward is instead

[TABLE]

2.1.3 Terminal state

As we mentioned earlier, we are using a savings mechanism. It ties in with the terminal state. We will consider three different terminal states. The first terminal state is the one in which

[TABLE]

where $mlim$ is a parameter that changes when encountering a terminal state. This means that the current money pool of the agent is greater than some threshold. Before beginning the new episode, the extra money $mdf=mon_{t+1}-mlim$ is distributed between three pools: $0.34$ goes to a savings pool $sav$ , which is never again used to trade; $0.33$ – to a reserves pool $res$ , which might be used again later; and $0.33$ is left in the money pool (so that after this still $mon_{t+1}>mlim$ ). Then $mlim$ is increased to the current value of $mon$ plus $mdf$ . The reward for going into this state is modified – it is the usual plus the amount of money that was added to $sav$ , i.e. $0.34\,mdf$ .

The second terminal state is determined by four conditions:

[TABLE]

Here $mlimn$ is a hyperparameter, which sets the lowest possible value of $mlim$ . In general the meaning of RSI is open to interpretation [11, 12], but if all of the above conditions are met, we take this as an indication that the market conditions are favourable (while the agent is low on money), so half of the money in $res$ are redistributed to the money pool to be used for trading again. After this $mlim$ is again changed – this time to $\max\{mlimn,mon_{t+1}+\frac{res}{2}\}$ , and the new episode begins.

The final terminal state is determined by the following:

[TABLE]

Here the market is seen as unfavourable so the only thing to do is to change $mlim$ to $wth_{t+1}$ , so that it will be easier to redistribute some of the money later.

2.1.4 Policy and hyperparameters

As is standard we use an $\varepsilon$ -greedy policy. It picks actions based on the value of the average of the two $Q$ -functions in a given state. In general the choice of $\varepsilon$ is not a trivial task and the performance of the algorithm can vary greatly depending on this choice. There are many suggestions on how to successfully manage this balance of exploration and exploitation (using for example decay of $\varepsilon$ [8], change point detection [13], adaptation based on value differences [14], etc.). What we want here is to be able to explore sufficiently when the market conditions change, which is very important for a fully online algorithm. In line with this we use a simple decay of $\varepsilon$ , but mixed with a probabilistic reset to a larger value. More precisely, first initialise a counter $i_{\varepsilon}$ to [math]. Before each choice of an action $i_{\varepsilon}$ is either incremented by $1$ or with probability $prob_{\varepsilon}$ it is reset to $\left\lceil\frac{e^{5}-2}{5}\right\rceil$ , if $i_{\varepsilon}\geq\left\lceil\frac{e^{5}-2}{5}\right\rceil$ (this resets $\varepsilon$ to about $0.2$ ). Afterwards, $\varepsilon$ is calculated according to the formula

[TABLE]

Next, we want to choose a learning rate $\alpha$ . It is well-known that the learning rate in gradient descent methods greatly affects the performance of a neural network (or of the RL algorithm using it) [15]. To avoid fixing the learning rate manually, we choose to use a cyclical one [16, 17]. More precisely, a counter $i_{\alpha}$ is initialised to [math]. Then, before taking the previously chosen action $\alpha$ is calculated using the formula [17]

[TABLE]

after which $i_{\alpha}$ is incremented by $1$ . This means that $\alpha$ varies between $\alpha_{max}$ and $\alpha_{min}$ with a period of $2T_{\alpha}$ steps.

2.2 Fast learning network

2.2.1 Preliminaries

Now we need to describe the neural network, that will approximate the $Q$ -function, namely FLN. FLNs use a parallel connection of two feedforward neural networks – one has a single hidden layer, while the other has none [7]. The hidden layer weights are random and fixed and only the output weights are learned. If, in addition, we choose the output neurons’ activation function to be the identity function and fix all the biases to zero, this effectively means that the approximating function is linear in the feature vector with additional fixed nonlinear terms from the hidden layer.

More precisely, we can denote the input and output as $X$ and $Y$ , respectively:

[TABLE]

where $n$ and $m$ are the respective sizes of the input and the output. Also, for shortness of notation, we denote the hidden layer output as

[TABLE]

Here $g$ is the activation function of the hidden layer and $Z$ is the input to the hidden layer. $r$ is the hidden layer size.

The weights are denoted by $W^{\mathrm{oi}}$ (input to output layers), $W^{\mathrm{hi}}$ (input to hidden layers) and $W^{\mathrm{oh}}$ (hidden to output layers):

[TABLE]

Now the $k$ -th component of the output vector is calculated by the following formula [7]:

[TABLE]

We can shorten the above to

[TABLE]

The optimisation is then performed only with respect to $W^{\mathrm{oi}}$ and $W^{\mathrm{oh}}$ .

2.2.2 Weight renormalisation

One common problem that we can encounter is that the weights in the neural network may diverge. This is especially true when the learning rate is large (but a large learning rate might help with adaptation). The above is a problem, since it is generally accepted that very large weights correlate with overfitting the training set and poor generalisation [18]. There are many ways to try to deal with this, but one simple method is to just renormalise the weight vector [19]. We do something similar with the output weights $W^{\mathrm{oi}}$ and $W^{\mathrm{oh}}$ .

More precisely, consider the output $y_{k}$ . It is obtained by scalar multiplication of the weight vector

[TABLE]

with the concatenation of the input $X$ and the hidden layer output $g(Z)$ . The vector (2.25) itself is the concatenation of the $k$ -th row of $W^{\mathrm{oi}}$ and the $k$ -th row of $W^{\mathrm{oh}}$ and it is learned by stochastic gradient descent. A record of the maximal value of its norm $maxw$ is kept. If the weight vector is longer than $1$ after an update, it is rescaled by a factor of $\frac{1}{maxw}$ . This keeps the weights from diverging and allows us to use large learning rates (the exact values of $\alpha_{min}$ and $\alpha_{max}$ are hyperparameters and will be specified later, but them being larger should help the agent adapt quickly).

3 Implementation details and testing

3.1 Observing, trading, hyperparameters

Before implementing the algorithm we need to consider a few points, namely how to record prices, how to trade, how exactly to structure the FLN and the values of the hyperparameters. The first question that needs to be answered is how often to record a price (and volume) for the feature vector. In principle the intervals can be of any length. One advantage of automated trading is that it can react quickly to the market. In line with this we want the intervals to be short, e.g. $1$ minute (more precisely the price is recorded in the beginning of the $1$ -minute interval). However, a trade occurs right after observing $5$ prices, which in practice means that trades are performed every $5$ minutes. This means that, depending on the volatility of the market, consecutive trades might happen on similar (often the same) prices and the profit from this is very small. Possibly too small to compensate for the trading fee. To counter this the observed price passes through a filter before being recorded, such that the relative change between two prices is greater than $0.01$ .

The next question is how exactly to trade. In testing we just assume that the trade occurs at the last recorded price $pr_{5}$ . To ensure this in practice one should use limit orders instead of market orders to avoid slippage. However, this poses the problem that the trade might not be executed at all (or at least not before new $5$ prices are recorded and it’s time to trade again). To ensure that it has up-to-date information the trader should cancel the order before the next $5$ prices are recorded, e.g. after recording $pr_{4}$ . Additionally, in such cases one can include the same negative reward as for trade failure due to insufficient funds to try to discourage orders that are later cancelled.

Next, we need to describe the neural network in more detail. It’s input is the feature vector (2.5), representing the state, while in its output we include one node for each action, i.e.

[TABLE]

This means that there are $27$ input nodes and $19$ output nodes. The size of the hidden layer is a hyperparameter of the algorithm and it is fixed to $r=50$ . Then the weight martices $W^{\mathrm{oi}}$ , $W^{\mathrm{hi}}$ and $W^{\mathrm{oh}}$ are $19\times 27$ , $50\times 27$ and $19\times 50$ , respectively, while the weight vector (2.25) has $77$ components.

From the above we see that there is a separate weight vector (2.25) for each action $a_{k}$ . After choosing and taking the action $a_{k}$ only the respective weight vector should be updated. So the gradient of $Q(s,a_{k})$ with respect to the weights is just the concatenation of $X$ and $g(Z)$ and it is the same for all actions.

For the neuron activation function we choose to use the logistic function, i.e.

[TABLE]

and the feature vector (2.5) is scaled, so that its norm is $6$ , before feeding it into the neural network.

Finally, we need to fix the rest of the hyperparameters (in addition to the hidden layer size). For the discount factor we choose $\gamma=0.05$ . The lowest possible value of $mlim$ is fixed to $mlimn=75$ . While we have eliminated the need to choose $\varepsilon$ , there is still a hyperparameter to fix and it is the probability for a reset of $\varepsilon$ . We choose this to be $prob_{\varepsilon}=10^{-4}$ . Likewise, we are not choosing the learning rate $\alpha$ . Nevertheless, there are still hyperparameters to fix there also, namely $\alpha_{min}$ , $\alpha_{max}$ and $T_{\alpha}$ . We choose the following values: $\alpha_{min}=10^{-3}$ , $\alpha_{max}=1$ and $T_{\alpha}=10^{3}$ .

3.2 Testing

With the above considerations in mind here we present the results of testing the algorithm (the whole code for which is written in Mathematica and is included in appendix A) on historical market data. Because we are using previously recorded prices, as already mentioned, there are a few things we can’t account for, one of which is that in testing the order and the trade are the same, i.e. the order is always fully fulfilled at exactly the recorded price. Also, the precision is much higher when testing as we may use numbers with many digits. In real world applications one needs to round appropriately (e.g. when using part of $res$ , when placing an order, etc.).

In principle nothing stops us from usign the trader in any market, but our tests are performed on historical data for the ADA/USDT cryptocurrency pair on Binance from the pair listing on $17.04.2018$ to $06.08.2021$ . We first pass the data through a filter as described in section 3.1. Then we use $4$ subsets of the filtered data. One is the whole dataset (figure 1(a)), while the other three are attempts to capture different market conditions – a “bearish” (figure 1(b)), a “bullish” (figure 1(c)) and a “mixed” (figure 1(d)) market.

For each dataset we perform $1000$ runs of the algorithm. Each run starts with $mon=100$ , $cns=0$ , $sav=0$ , $res=0$ and $mlim=mon$ and we record the performance in terms of the sum of the wealth (2.6) and the $sav$ and $res$ pools, i.e.

[TABLE]

as well as the value of $sav$ alone, at the end of the run. In order to evaluate the effectiveness of the algorithm we also perform $1000$ runs with randomly selected actions on each time step. In this case $twth=wth$ , as the savings mechanism depends on multiple reinforcement learning ingredients.

After this we arrange the data in histograms (figures 2, 3, 4 and 5), which also include the sample minimum and maximum, and calculate the sample means, medians, standard deviations, as well as the empirical probabilities for finishing a run with $twth\leq 100$ . The last is included as a measure of the risk of losing money after a run. All of these are arranged in tables 1, 2, 3 and 4. As can be seen, our algorithm performs better than random in all datasets.

In particular, for the full dataset we observe an increase in the mean value of $twth$ of about $39\%$ when taking non-random actions versus random ones. The median also increases – by $58\%$ . Additionally, the probability for finishing a run with $twth\leq 100$ (i.e. for losing money) is $86\%$ smaller when taking non-random actions.

Similar calculations can be made for the rest of the datasets. In all of the cases the algorithm has higher mean and median values of $twth$ when compared to random, as well as lower values of $P(twth\leq 100)$ , i.e. it makes more money on average and has a lower chance to lose money. Probably the most interesting case is the “bearish” market one, as there it is the hardest to make a profit. This is reflected in our results as the differences with random are the smallest. More specifically, the mean and median of $twth$ are about $4\%$ larger and the probability of $twth\leq 100$ – about $5\%$ smaller.

For completeness we also include the relative performance in the other two datasets. In terms of the mean of $twth$ our algorithm performs about $5\%$ better in the “bullish” dataset and $12\%$ better in the “mixed” dataset. In terms of the median the increases are $6\%$ and $14\%$ for the “bullish” and “mixed” cases, respectively. Finally, in terms of $P(twth\leq 100)$ the decreases are $74\%$ and $26\%$ , respectively, for the two cases.

4 Conclusion

In this paper we attempted to tackle the challenging problem of market prediction using machine learning. More specifically, we introduced a deep reinforcement learning agent that is meant to adapt to market conditions and trade fully online. The main components of our algorithm are a more or less standard Double $Q$ -learning framework coupled with a Fast Learning Network, used to approximate the $Q$ -functions. On top of that we added a mechanism, which takes money out of the trading pool, both as a means to take profit and to boost performance by reusing some of it at a more favourable moment.

After this we tested the algorithm on historical market data, which was chosen so that it captures different market conditions. We observed that our agent performs better than random on all datasets – both in terms of profit and probability of loss at the end of a run through the data. Furthermore, it did so even in a “bearish” market, when the overall market trend is downward and, most importantly, without any prior learning on big offline dataset. We can view the latter as the main strength of our algorithm.

Appendix A Mathematica code

$\boldsymbol{\text{CloseKernels}[];}\\ \boldsymbol{\text{LaunchKernels}[];}\\ \boldsymbol{\text{ClearAll}[\text{{``}Global$ \grave{}

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Treleaven, M. Galas, V. Lalchand, Algorithmic trading review, Communications of the ACM 56(11), 76-85 (2013).
2[2] D. Silver, A. Huang, C. J. Maddison et al. , Mastering the game of Go with deep neural networks and tree search, Nature 529, 484-489 (2016).
3[3] O. Vinyals, I. Babuschkin, W. M. Czarnecki et al. , Grandmaster level in Star Craft II using multi-agent reinforcement learning, Nature 575, 350-354 (2019).
4[4] A. Millea, Deep Reinforcement Learning for Trading—A Critical Survey, Data 6(11), 119 (2021).
5[5] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, 2004 IEEE International Joint Conference on Neural Networks, Budapest, Hungary, Volume 2, 985-990 (2004).
6[6] S. Ding, X. Xu, R. Nie, Extreme learning machine and its applications, Neural Computing and Applications 25, 549-556 (2014).
7[7] G. Li, P. Niu, X. Duan, X. Zhang, Fast learning network: a novel artificial neural network with a fast learning speed, Neural Computing and Applications 24, 1683-1695 (2014).
8[8] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd Ed., The MIT Press, Cambridge, MA (2018).