Reinforcement Learning with Depreciating Assets

Taylor Dohmen; Ashutosh Trivedi

arXiv:2302.14176·cs.AI·March 1, 2023

Reinforcement Learning with Depreciating Assets

Taylor Dohmen, Ashutosh Trivedi

PDF

Open Access

TL;DR

This paper explores reinforcement learning where rewards, termed assets, depreciate over time, introducing a new framework that accounts for reward decay and proposing a model-free method to find optimal policies.

Contribution

It introduces the concept of depreciating assets in reinforcement learning, extending traditional models to include reward decay over time with a Bellman-style optimality framework.

Findings

01

Developed a Bellman-style equation for assets with decay.

02

Proposed a model-free reinforcement learning algorithm for this setting.

03

Demonstrated the effectiveness of the approach in theoretical scenarios.

Abstract

A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate a Bellman-style equational characterization of optimality in this…

Equations121

⟨ k = 1 \sum n r_{k} γ^{n - k} ⟩_{n = 1}^{\infty},

⟨ k = 1 \sum n r_{k} γ^{n - k} ⟩_{n = 1}^{\infty},

3, (3 γ + 4), (3 γ^{2} + 4 γ + 5), (3 γ^{3} + 4 γ^{2} + 5 γ + 3),

3, (3 γ + 4), (3 γ^{2} + 4 γ + 5), (3 γ^{3} + 4 γ^{2} + 5 γ + 3),

(3 γ^{4} + 4 γ^{3} + 5 γ^{2} + 3 γ + 4), \dots

\displaystyle{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3}+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\lambda(3\gamma{+}4})+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\lambda^{2}(3\gamma^{2}{+}4\gamma{+}5})+{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\lambda^{3}(3\gamma^{3}{+}4\gamma^{2}{+}5\gamma{+}3)}+

\displaystyle{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3}+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\lambda(3\gamma{+}4})+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\lambda^{2}(3\gamma^{2}{+}4\gamma{+}5})+{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\lambda^{3}(3\gamma^{3}{+}4\gamma^{2}{+}5\gamma{+}3)}+

\displaystyle\qquad{\color[rgb]{.75,0,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,0,.25}\lambda^{4}(3\gamma^{4}{+}4\gamma^{3}{+}5\gamma^{2}{+}3\gamma+4)}+\ldots

\displaystyle=({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3}{+}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3\lambda\gamma}{+}{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}3\gamma^{2}\lambda^{2}}{+}\cdots)+({\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}4\lambda}{+}{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}4\lambda^{2}\gamma}{+}{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}4\lambda^{3}\gamma^{2}}{+}\cdots)+

\displaystyle\qquad({\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}5\lambda^{2}}{+}{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}5\lambda^{3}\gamma}{+}{\color[rgb]{.75,0,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,0,.25}\lambda^{5}\gamma^{2}}{+}\cdots)+({\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}3\lambda^{3}}{+}{\color[rgb]{.75,0,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,0,.25}3\lambda\gamma^{4}}{+}3\gamma^{2}\lambda^{5}{+}\cdots)+\cdots

= 3 (1 + λγ + γ^{2} λ^{2} + \dots) + 4 λ (1 + λγ + λ^{2} γ^{2} + \dots) +

5 λ^{2} (1 + λγ + λ^{2} γ^{2} + \dots) + 3 λ^{3} (1 + λγ + γ^{2} λ^{2} + \dots) + \dots

= \frac{3 + 4 λ + 5 λ ^{2} + 3 λ ^{3} + \dots}{( 1 - λγ )}

= \frac{3 + 4 λ + 5 λ ^{2}}{( 1 - λγ ) ( 1 - λ ^{3} )} .

3, \frac{3 γ + 4}{2}, \frac{3 γ ^{2} + 4 γ + 5}{3}, \frac{3 γ ^{3} + 4 γ ^{2} + 5 γ + 3}{4},

3, \frac{3 γ + 4}{2}, \frac{3 γ ^{2} + 4 γ + 5}{3}, \frac{3 γ ^{3} + 4 γ ^{2} + 5 γ + 3}{4},

\frac{3 γ ^{4} + 4 γ ^{3} + 5 γ ^{2} + 3 γ + 4}{5}, \dots

λ \to 1 lim (1 - λ) \frac{3 + 4 λ + 5 λ ^{2}}{( 1 - λγ ) ( 1 - λ ^{3} )}

λ \to 1 lim (1 - λ) \frac{3 + 4 λ + 5 λ ^{2}}{( 1 - λγ ) ( 1 - λ ^{3} )}

= \frac{3 + 4 + 5}{3 ( 1 - γ )} .

n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n}), and

n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n}), and

n \to \infty lim inf \frac{1}{n} k = 1 \sum n R (s_{k}, a_{k}) .

V_{λ} (s)

V_{λ} (s)

V (s)

V_{λ} (s) = a \in A max R (s, a) + λ E_{T} [V_{λ} (t) ∣ s, a],

V_{λ} (s) = a \in A max R (s, a) + λ E_{T} [V_{λ} (t) ∣ s, a],

Q_{λ} (s, a) = R (s, a) + λ E_{T} [V_{λ} (t) ∣ s, a] .

Q_{λ} (s, a) = R (s, a) + λ E_{T} [V_{λ} (t) ∣ s, a] .

Q_{λ}^{n + 1} (s, a) \leftarrow Q_{λ}^{n} (s, a) + α_{n} (R (s, a) + λ V_{λ}^{n} (t) - Q_{λ}^{n} (s, a)),

Q_{λ}^{n + 1} (s, a) \leftarrow Q_{λ}^{n} (s, a) + α_{n} (R (s, a) + λ V_{λ}^{n} (t) - Q_{λ}^{n} (s, a)),

⟨ k = 1 \sum n R (s_{k}, a_{k}) γ^{n - k} ⟩_{n = 1}^{\infty}

⟨ k = 1 \sum n R (s_{k}, a_{k}) γ^{n - k} ⟩_{n = 1}^{\infty}

n = 1 \sum \infty λ^{n - 1} k = 1 \sum n R (s_{k}, a_{k}) γ^{n - k},

n = 1 \sum \infty λ^{n - 1} k = 1 \sum n R (s_{k}, a_{k}) γ^{n - k},

V_{λ}^{γ} (s) = π \in Π^{M} sup E_{s}^{π} [n = 1 \sum \infty λ^{n - 1} k = 1 \sum n R (s_{k}, a_{k}) γ^{n - k}] .

V_{λ}^{γ} (s) = π \in Π^{M} sup E_{s}^{π} [n = 1 \sum \infty λ^{n - 1} k = 1 \sum n R (s_{k}, a_{k}) γ^{n - k}] .

V_{λ}^{γ} (s) = a \in A max \frac{R ( s , a )}{1 - λγ} + λ E_{T} [V_{λ}^{γ} (t) ∣ s, a] .

V_{λ}^{γ} (s) = a \in A max \frac{R ( s , a )}{1 - λγ} + λ E_{T} [V_{λ}^{γ} (t) ∣ s, a] .

n = 1 \sum \infty k = 1 \sum n λ^{k - 1} R (s_{k}, a_{k}) λ^{n - k} γ^{n - k} .

n = 1 \sum \infty k = 1 \sum n λ^{k - 1} R (s_{k}, a_{k}) λ^{n - k} γ^{n - k} .

(n = 1 \sum \infty x_{n}) (n = 1 \sum \infty y_{n}) = n = 1 \sum \infty k = 1 \sum n x_{k} y_{n - k} = X Y .

(n = 1 \sum \infty x_{n}) (n = 1 \sum \infty y_{n}) = n = 1 \sum \infty k = 1 \sum n x_{k} y_{n - k} = X Y .

(n = 1 \sum \infty (λγ)^{n - 1}) (n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n})),

(n = 1 \sum \infty (λγ)^{n - 1}) (n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n})),

\frac{1}{1 - λγ} n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n}) .

\frac{1}{1 - λγ} n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n}) .

V_{λ}^{γ} (s)

V_{λ}^{γ} (s)

= \frac{1}{1 - λγ} π \in Π^{M} sup E_{s}^{π} [n = 1 \sum \infty λ^{n - 1} R (s_{n}, a_{n})]

= \frac{V _{λ} ( s )}{1 - λγ} .

V_{λ}^{γ} (s)

V_{λ}^{γ} (s)

= a \in A max \frac{R ( s , a )}{1 - λγ} + λ E_{T} [V_{λ}^{γ} (t) ∣ s, a] .

minimize s \in S \sum x_{s} v_{s} subject to

minimize s \in S \sum x_{s} v_{s} subject to

\frac{R ( s , a )}{1 - λγ} \leq t \in S \sum v_{t} (δ_{s, t} - \frac{λ T ( t ∣ s , a )}{1 - λγ})

π (s) = a \in A ar g max \frac{R ( s , a )}{1 - λγ} + λ E_{T} [v_{t}^{*} ∣ s, a] .

π (s) = a \in A ar g max \frac{R ( s , a )}{1 - λγ} + λ E_{T} [v_{t}^{*} ∣ s, a] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic theories and models

Full text

Reinforcement Learning with Depreciating Assets

Taylor Dohmen

University of Colorado Boulder

Ashutosh Trivedi

University of Colorado Boulder

Abstract

A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate a Bellman-style equational characterization of optimality in this context and develop a model-free reinforcement learning approach to obtain optimal policies.

1 Introduction

Time preference [Loewenstein and Jon, 1992; Frederick et al., 2002] refers to the tendency of rational agents to value potential desirable outcomes in proportion to the expected time before such an outcome is realized. In other words, agents prefer to get a future reward sooner rather than later, all else being equal, and similarly, agents prefer to experience negative outcomes later rather than sooner. This phenomenon is typically codified in mathematical models in terms of discounting [Shapley, 1953] and has been applied to a diverse array of disciplines concerned with optimization such as economics [Heal, 2007; Philibert, 1999], game theory [Filar and Vrieze, 1996], control theory [Puterman, 1994], and reinforcement learning [Sutton and Barto, 2018]. These models focus on the situation in which an agent moves through a stochastic environment in discrete time by selecting an action to perform at each time step and receiving an immediate reward based on the selected action and environmental state. In particular, we consider exponential discounting, as introduced by Shapley [1953], in which the agent carries this process on ad infinitum to generate an infinite sequence of rewards $\left\langle r_{n}\right\rangle^{\infty}_{n=1}$ with the goal of maximizing, with respect to a discount factor $\lambda\in(0,1)$ , the discounted sum $\sum^{\infty}_{n=1}\lambda^{n-1}r_{n}$ . The discount factor is selected as a parameter and quantifies the magnitude of the agent’s time preference.

A notable characteristic of the aforementioned discounted optimization framework is an implicit assumption that the utility of a reward remains constant once it is obtained by a learning agent. While this seemingly innocuous supposition simplifies the model and helps to make it amenable to analysis, there are a number of scenarios where such an assumption is not appropriate. Consider, for instance, the most basic and ubiquitous of rewards used to incentivize human behaviors: money. The value of money tends to decay with time according to the rate of inflation, and the consequences of this decay are a topic of wide spread interest and intense study [Hulten and Wykoff, 1980; Comley, 2015; Beckerman, 1991; Fergusson, 2010]. Recognizing the fundamental role such decay has in influencing the dynamics of economic systems throughout the world, we consider its implications with respect to optimization and reinforcement learning in Markov decision processes.

1.1 Asset Depreciation

When discussing a situation with decaying reward values, it is useful to distinguish between potential future rewards and actual rewards that have been obtained. As such, we introduce the term asset to refer to a reward that has been obtained by an agent at a previous moment in time. Using this terminology, the present work may be described as an inquiry into optimization and learning under the assumption that assets depreciate. Depreciation, a term borrowed from the field of finance and accounting [Wright, 1964; Burt, 1972], describes exactly the phenomenon where the value of something decays with time.

We propose a notion of depreciation that is inspired by traditional discounting and is based on applying the same basic principle of time preference to an agent’s history in addition to its future. More precisely, we consider the situation in which an agent’s behavior is evaluated with respect to an infinite sequence of cumulative accrued assets, each of which is discounted in proportion to how long ago it was obtained. That is, we propose evaluating the agent in terms of functions on the sequence of assets

[TABLE]

where $\gamma\in(0,1)$ is a discount factor, rather than on the sequence of rewards $\left\langle r_{n}\right\rangle^{\infty}_{n=1}$ . To motivate the study of depreciation and illustrate its naturalness, we examine the following hypothetical case-study.

Example 1 (Used Car Dealership).

Consider a used car dealership with a business model involving purchasing used cars in locations with favorable regional markets, driving them back to their shop, and selling them for profit in their local market. Suppose that our optimizing agent is an employee of this dealership, tasked with managing capital acquisition. More specifically, this employee’s job is to decide the destination from which the next car should be purchased, whenever such a choice arises. The objective of the agent is to maximize the sum of the values of all vehicles in stock at the dealership over a discounted time-horizon for some discount factor $\lambda\in(0,1)$ . Note that the discounted time-horizon problem is equivalent to the problem of maximizing expected terminal payoff of the process given a constant probability $(1-\lambda)$ of terminating operations at any point.

It has long been known [Wykoff, 1970; Ackerman, 1973] that cars tend to continually depreciate in value after being sold as new, and so any reasonable model for the value of all vehicles in the inventory should incorporate some notion of asset depreciation. Suppose that another discount factor $\gamma\in(0,1)$ captures the rate at which automobiles lose value per unit of time. Considering $\gamma$ -depreciated rewards and $\lambda$ -discounted horizon, the goal of our agent can be defined as a discounted depreciating optimization problem. Alternatively, one may seek to optimize the long run average (mean payoff) of $\gamma$ -depreciated rewards.

1.2 Discounted Depreciating Payoff

Consider the sequence $x=\left\langle 3,4,5,3,4,5,\ldots\right\rangle$ of (absolute) rewards accumulated by the agent. In the presence of depreciation, the cumulative asset values at various points in time follow the sequence

[TABLE]

For the $\lambda$ -discounted time horizon, the value of the assets can be computed as follows:

[TABLE]

Notice that this $\gamma$ -depreciated sum is equal to the $\lambda$ -discounted sum when immediate rewards are scaled by a factor $\frac{1}{1-\lambda\gamma}$ . We show that this is not a mere coincidence, and prove that this equality holds also for general MDPs.

1.3 Average Depreciating Payoff

Next, consider the long-run average of the depreciating asset values as the limit inferior of the sequence

[TABLE]

Based on classical Tauberian results [Bewley and Kohlberg, 1976], it is tempting to conjecture that the $\lambda$ -discounted, $\gamma$ -depreciating value converges to this mean as $\lambda\to 1$ , e.g.

[TABLE]

Indeed, we prove that this conjecture holds.

Contributions.

The highlights of this paper are given below.

$\blacktriangleright$

We initiate the study of discounted and average payoff optimization in the presence of depreciation dynamics.

$\blacktriangleright$

We characterize the optimal value of the discounted depreciating payoff via Bellman-style optimality equations and use them to show that stationary deterministic policies are sufficient for achieving optimality. Moreover, our characterization enables computing the optimal value and an optimal policy in polynomial time in the planning setting.

$\blacktriangleright$

The optimality equation also facilitates a formulation of a variant of Q-learning that is compatible with asset depreciation, thereby providing a model-free reinforcement learning approach to obtain optimal policies in the learning setting.

$\blacktriangleright$

We show the classical Tauberian theorem relating discounted and average objectives can be extended to the depreciating reward setting. This result allows us to establish the sufficiency of stationary deterministic policies for optimality with respect to the average depreciating payoffs.

Organization.

We begin by introducing necessary notation and reviewing the relevant technical background. Section 3 develops results on discounted depreciating payoff, while Section 4 develops results for the average depreciating objective. We discuss some closely related work in Section 5 and recap our contributions in the concluding section.

2 Preliminaries

Let $\mathbb{R}$ be the set of real numbers and $\mathbb{N}$ the set of natural numbers. For a set $X$ , we write $\left|X\right|$ to denote its cardinality and $\mathsf{Dist}\left(X\right)$ for the set of all probability distributions over $X$ . A point distribution over $X$ is one that assigns probability 1 to a unique element of $X$ and probability 0 to all others.

The technical portions of the paper are carried out within the standard mathematical framework of asymptotic optimization and learning in environments modeled as finite Markov decision processes. Our presentation follows the conventions set in the standard textbooks on the optimization and learning [Puterman, 1994; Filar and Vrieze, 1996; Sutton and Barto, 1998; Feinberg and Shwartz, 2012].

2.1 Markov Decision Processes

A (finite) Markov decision process (MDP) $M$ is a tuple $(S,A,T,R)$ in which $S$ is a finite set of states, $A$ is a finite set of actions, $T:\left(S\times A\right)\to\mathsf{Dist}\left(S\right)$ is a stochastic transition function specifying, for any $s,t\in S$ and $a\in A$ the conditional probability $T(t\mid s,a)$ of moving to state $t$ given that the current state is $s$ and that action $a$ has been chosen, and $R:\left(S\times A\right)\to\mathbb{R}$ is a real-valued reward function mapping each state-action pair to a numerical valuation. For any function $f:S\to\mathbb{R}$ , i.e. any random variable on the state space of the MDP, we write $\mathbb{E}_{T}\left[f(t)\mid s,a\right]$ to denote the conditional expectation $\sum_{t\in S}f(t)T(t\mid s,a)$ of $f$ on the successor state, given that the agent has selected action $a$ from state $s$ . A path in $M$ is a sequence $s_{1}a_{1}s_{2}\cdots a_{n}s_{n+1}$ of alternating states and actions such that $0<T(s_{k+1}\mid s_{k},a_{k})$ at every index. Let $\mathcal{F}(M)$ denote the set of all finite paths in $M$ and $\mathcal{I}(M)$ denote the set of all infinite paths in $M$ .

Payoffs, Policies, and Optimality.

We focus on infinite duration quantitative optimization problems where an outcome may be concretized as an infinite path in the MDP. Such an outcome is evaluated relative to some mapping into the real numbers $\mathcal{I}(M)\to\mathbb{R}$ called a payoff. A policy on $M$ is a function $\pi:\mathcal{F}(M)\to\mathsf{Dist}\left(A\right)$ that chooses an a distribution over the action set, given a finite path in $M$ . Fixing a policy $\pi$ induces, for each state $s$ , a unique probability measure $\mathbb{P}^{\pi}_{s}$ on the probability space over the Borel subsets of $\mathcal{I}(M)$ . This enables the evaluation of a policy, modulo a payoff and initial state $s$ , in expectation $\mathbb{E}^{\pi}_{s}$ . Let $\Pi^{M}$ be the set of all policies on the MDP $M$ . A policy is optimal for a payoff if it maximizes, amongst all other policies, the expected value of that payoff, and this maximal expectation is called the value of the payoff on $M$ .

Strategic Complexity.

The strategic complexity of a payoff characterizes the necessary structure required for a policy to be optimal. A qualitative aspect of strategic complexity is based on whether or not there exist environments for which optimal policies are necessarily probabilistic (mixed). A policy is deterministic (pure) if returns a point distribution for every input. A policy is stationary if $\pi(s_{1}a_{1}\cdots a_{n-1}s_{n})=\pi(s_{n})$ holds at every time $n$ . The class of deterministic stationary policies is of special interest since there are finitely many such policies on any finite MDP; we consider these policies as functions $S\to A$ .

2.2 Discounted and Average Payoffs

Given a path $s_{1}a_{1}s_{2}\cdots$ in an MDP, two well-studied objectives are the discounted payoff, relative to a discount factor $\lambda\in(0,1)$ , and the average payoff, defined as

[TABLE]

The discounted value and average value functions are defined

[TABLE]

A stronger notion of optimality, specific to the discounted payoff, is Blackwell optimality. A policy $\pi$ is Blackwell optimal if there exists a discount factor $\lambda_{0}\in(0,1)$ such that $\pi$ is optimal for the discounted payoff with any discount factor in the interval $[\lambda_{0},1)$ .

An alternative characterization of the discounted value is as the unique solution to the optimality equation

[TABLE]

which is the starting point for establishing the following result on the complexity of discounted and average payoffs [Puterman, 1994; Feinberg and Shwartz, 2012; Filar and Vrieze, 1996].

Theorem 1.

Both discounted and average payoffs permit deterministic stationary optimal policies. Moreover, optimal values for both payoffs can be computed in polynomial time.

2.3 Reinforcement Learning

Reinforcement learning (RL) [Sutton and Barto, 2018] is a sampling-based optimization paradigm based on the feedback received from the environment in the form of scalar rewards. The standard RL scenario assumes a discounted payoff, and model-free approaches typically leverage the state-action value or Q-value: defined as the optimal value from state $s$ , given that action $a$ has been selected, and is the solution of the equation

[TABLE]

The Q-value provides the foundation for the classic Q-Learning algorithm [Watkins and Dayan, 1992], which learns an optimal policy by approximating $Q_{\lambda}$ with a sequence $Q^{n}_{\lambda}$ of maps which asymptotically converge to $Q_{\lambda}$ . In particular, $Q^{1}_{\lambda}$ is initialized arbitrarily and then the agent explores the environment by selecting action $a=\operatorname*{\arg\max}_{a\in A}Q^{n}_{\lambda}(s,a)$ from the current state $s$ and performing the update

[TABLE]

in which $t$ is the next state as determined by the outcome of sampling the conditional distribution $T(\cdot\mid s,a)$ , the family of $\alpha_{n}\in(0,1)$ are time-dependent parameters called learning rates, and $V^{n}_{\lambda}(t)=\max_{a\in A}Q^{n}_{\lambda}(t,a)$ . The following theorem gives a sufficient condition for asymptotic convergence of the $Q$ -learning algorithm.

Theorem 2 (Watkins and Dayan [1992]).

If every state-action pair in the environmental decision process is encountered infinitely often and the learning rates $0\leq\alpha_{n}<1$ satisfy the Robbins-Monroe conditions $\sum_{n=1}^{\infty}\alpha_{n}=\infty$ and $\sum_{n=1}^{\infty}\alpha_{n}^{2}<\infty$ , then $Q^{n+1}_{\lambda}(s,a){\to}Q_{\lambda}$ almost surely as $n{\to}\infty$ .

2.4 Depreciating Assets

We define variations on the discounted and average payoffs based on the idea that the value of an asset decays geometrically in proportion with the amount of time elapsed since it was obtained as a reward. That is, we consider the situation in which a payoff is determined as a function of the sequence $\left\langle R(s_{n},a_{n})\right\rangle^{\infty}_{n=1}$ , but rather of the sequence

[TABLE]

of exponential recency-weighted averages of the agent’s assets, where $\gamma\in(0,1)$ is a discount factor.

3 Discounted Depreciating Payoff

In this section, we study discounted optimization, for $\lambda\in(0,1)$ , under depreciating asset dynamics. The payoff in this setting is captured by the expression

[TABLE]

which has a corresponding value function

[TABLE]

Let us now return to the used car dealership example.

Example 2 (Used Car Dealership Cont.).

Recognizing that cars depreciate continually after their first purchase, the employee realizes that their model should incorporate a notion of asset depreciation. After a bit of market research, the employee selects another discount factor $\gamma\in(0,1)$ to capture the rate at which automobiles typically lose value over a given time step. Using both discount factors $\lambda$ and $\gamma$ , the employee can model the scenario as a discounted depreciating optimization problem.

For the sake of simplicity, suppose that there are only two locations $s_{1}$ and $s_{2}$ from which to choose the next target market, and that the only point where the employee has more than one possible action is at the dealership $s_{d}$ (from where they can chose action $a_{1}$ to go to $s_{1}$ or $a_{2}$ to go to $s_{2}$ ). Realizing that it is unreasonable to plan without expecting unforeseen delays, the employee also introduces two parameters $\rho_{1}$ and $\rho_{2}$ , which are success rates for buying a desired vehicle in $s_{1}$ and $s_{2}$ respectively. Given that the agent is in location $s_{i}$ , the rate $\rho_{i}$ is interpreted as the probability that they find a seller and purchase a vehicle before the end of the day and thus $1-\rho_{i}$ is the probability that they fail to do so. This situation is represented graphically as a finite MDP in Figure 1, where actions are displayed in red, transition probabilities in blue, and immediate rewards (i.e. car values when they are stocked) in green. If an action is omitted from an edge label, then there is only one action $a$ available. If a transition probability is omitted, then the transition is deterministic, i.e. occurs with probability 1. If a reward value is omitted, then the reward obtained is 0.

In traditional discounted optimization, the discount factor $\lambda$ imposes a certain type of trade-off. Suppose, for instance, that $\rho_{1}$ is large while $r_{1}$ is small and that $\rho_{2}$ is small while $r_{2}$ is large. Then a small discount factor indicates that it may payoff more to take action $a_{1}$ since it is likely that taking $a_{2}$ will result in significant delays and thus diminish the value of the eventual reward $r_{2}$ . On the other hand, if the discount factor is close to 1, then it may be worth it for the agent to accept the high probability of delay since the eventual discounted value will be closer to $r_{2}$ .

Adding in the depreciation dynamics with discount factor $\gamma$ , the trade-off remains, but to what extent depreciation alters the dynamics of a given environment and policy is unclear. Intuition may suggest that introducing depreciation to discounted optimization should only make the risk-reward trade-off sharper, and one might further conjecture that when $\gamma$ is close to 0, the higher decay rate of cumulative asset value should drive an agent towards riskier behavior. On the other hand, it is plausible that a depreciation factor close to one might embolden the agent towards similar risky actions because the opportunity cost of such behavior diminishes as assets are accumulated in greater quantities. As we proceed with our analysis of the discounted depreciating payoff we attempt to shed light on questions like this and get to the core of what depreciation entails in this context.

Our first main result establishes a Bellman-type equational characterization the discounted depreciating value.

Theorem 3 (Optimality Equation).

The discounted depreciating value is the unique solution of the equation

[TABLE]

Proof.

By splitting the term $\lambda^{n-1}$ occurring in the definition of the discounted depreciating payoff into the product $\lambda^{n-k}\lambda^{k-1}$ and distributing these factors into the inner summation, we obtain the expression

[TABLE]

The next step of the proof relies on the following classical result of real analysis (c.f. Theorem 3.50 of Rudin [1976]).

$\ulcorner$

Mertens’ Theorem.

Let $\sum^{\infty}_{n=1}x_{n}=X$ and $\sum^{\infty}_{n=1}y_{n}=Y$ be two convergent series of real numbers. If at least one of the given series converges absolutely, then their Cauchy product converges to the product of their limits:

[TABLE]

The series (3) may be factored into the Cauchy product

[TABLE]

and since both terms in this Cauchy product converge absolutely, Mertens’ theorem applies. Thus, noticing that the left-hand series is geometric, the expression (4) is equivalent to

[TABLE]

Consequently, the discounted depreciating value may be written as

[TABLE]

The equational characterization of the discounted value $V_{\lambda}$ now facilitates the derivation of the desired equational characterization of the discounted depreciating value $V_{\lambda}^{\gamma}$ as

[TABLE]

∎

An immediate consequence of Theorem 3 is a characterization of the strategic complexity of discounted depreciating payoffs.

Corollary 1 (Strategic Complexity).

For any discounted depreciating payoff over any finite MDP, there exists an optimal policy that is stationary and deterministic.

Theorem 3 enables a number of extensively studied algorithmic techniques to be adapted for use under the discounted depreciating payoff. In particular, the equational characterization of the discounted depreciating value implies that it is the unique fixed point of a contraction mapping [Banach, 1922], which in turn facilitates the formulation of suitable variants of planning algorithms based on foundational methods such as value iteration and linear programming. This allows us to bound the computational complexity of determining discounted depreciating values in terms of the size of the environmental MDP and the given discount factors.

Theorem 4 (Computational Complexity).

The discounted depreciating value and a corresponding optimal policy are computable in polynomial time.

Proof.

Let $\delta_{i,j}=\begin{cases}1&\textnormal{if }i=j\\ 0&\textnormal{otherwise}\end{cases}$ be the Kronecker delta. Suppose that, for each state $s$ in the environment $M$ , we have an associated real number $0<x_{s}$ , chosen arbitrarily. The unique solution to the following linear program is the vector of values from each state of $M$ .

[TABLE]

From a solution $v^{*}$ to (7), an optimal policy can be obtained as

[TABLE]

Alternatively, an optimal policy may be derived from the solution to the dual linear program given as follows.

[TABLE]

In particular, if $y^{*}$ is a solution to (8), then any policy $\pi$ for which the inequality $0<y^{*}_{s,\pi(s)}$ holds at every state is optimal. The correctness of these linear programs follows from the proof of Theorem 3. Since linear programs can be solved polynomial time, the theorem follows. ∎

Theorem 3 allows the formulation of an associated Q-value

[TABLE]

which may be used to construct a Q-learning iteration scheme for discounted depreciating payoffs as

[TABLE]

Theorem 5.

If each state-action pair of the environment is encountered infinitely often and the learning rates satisfy the Robbins-Monroe convergence criteria

[TABLE]

then iterating (9) converges almost surely to the discounted depreciating Q-value as $n\to\infty$ :

[TABLE]

Proof.

Equations (5) and (6) show that the optimality equation for the discounted depreciating value reduces to the optimality equation for the discounted value, modulo a multiplicative factor dependent on $\lambda$ and $\gamma$ . It therefore follows that discounted depreciating Q-learning, via iteration of (9), converges in the limit to the optimal $Q^{\gamma}_{\lambda}$ under the same conditions that standard discounted Q-learning, via iteration of (1), converges in the limit to the optimal $Q_{\lambda}$ . Hence, we conclude that discounted depreciating Q-learning asymptotically converges given that each state-action pair is encountered infinitely often and that the convergence conditions in the theorem statement are satisfied by the learning rates. ∎

3.1 Discussion

Besides the technical implications of Theorem 3, its proof provides some insight about the interplay between discounting and depreciation. A foundational result [Bewley and Kohlberg, 1976] in the theory of infinite-horizon optimization establishes that over a common MDP the discounted value asymptotically approaches the average value, up to a multiplicative factor of $(1-\lambda)$ , as $\lambda$ approaches 1 from below:

[TABLE]

Following this approach, we consider the asymptotic behavior of the discounted depreciating value when taking similar limits of the discount factors. Using the identity $V_{\lambda}^{\gamma}=\frac{V_{\lambda}}{1-\lambda\gamma}$ from equation (5) as the starting point for taking these limits yields the equations

[TABLE]

The relationships described by equations (12) and (11), illustrated by Figure 2, are justified conceptually by a simple interpretation that is helpful for building intuition around the behavior of the discounted depreciating payoff. One can think of the standard discounted payoff as a special case of the discounted depreciating payoff where $\gamma=0$ . That is, the optimizing agent working towards maximizing a discounted payoff does not consider the value of their assets whatsoever at any point in time; the only quantities of concern from their perspective are the incoming stream of rewards. Interpreting $\gamma$ as a measure of the agent’s memory of past outcomes, it follows naturally that the discounted depreciating payoff reduces to the discounted payoff when the agent has no recollection whatsoever. Connecting this notion back to depreciation, it can be argued that, from the agent’s perspective, externally driven depreciation of assets is morally equivalent to an internally driven perception of depreciation based on an imperfect recollection of past events.

Conversely, an agent with a perfect memory operating under a discounted payoff would end up maximizing this payoff on the sequence of cumulative assets $\left\langle\sum^{n}_{k=1}R(s_{k},a_{k})\right\rangle^{\infty}_{n=1}$ rather than the sequence $\left\langle R(s_{n},a_{n})\right\rangle_{n=1}^{\infty}$ of immediate rewards. Assuming positive immediate rewards, this results in a greater value than would be obtained on the reward sequence itself, as evidenced by the plot in Figure 2. As a consequence of the contraction property resulting from the standard discounting, the overall sum converges in spite of the fact that the cumulative asset stream may not be bounded.

4 Average Depreciating Payoff

Let us now consider the asymptotic average evaluation criterion, given that assets depreciate. The payoff of an outcome in this context is defined as

[TABLE]

and the associated average depreciating value function is

[TABLE]

Our main result in this section asymptotically relates the average depreciating value and the discounted depreciating value.

Theorem 6 (Tauberian Theorem).

The limit of discounted depreciating value as $\lambda\to 1$ from below, scaled by $(1-\lambda)$ , converges to the average depreciating value:

[TABLE]

The proof of Theorem 6 uses the following pair of lemmas.

Lemma 1.

For any finite path in the environmental MDP,

[TABLE]

Proof.

We proceed by induction on $n$ .

Base case.

Suppose that $n=1$ . Then both expressions occurring in (13) evaluate to $R(s_{1},a_{1})$ .

Inductive case.

Suppose that (13) holds for $n-1$ . By splitting the summation on the left-hand side of (13), we obtain the expression

[TABLE]

Factoring $\frac{n-1}{n}$ from the double summation in this expression yields

[TABLE]

Now, applying the inductive hypothesis, this may be rewritten as

[TABLE]

Factoring out $\frac{1}{n(1-\gamma)}$ from the entire expression, we get

[TABLE]

Distributing through the numerator results in the expression

[TABLE]

and removing those terms that cancel additively yields

[TABLE]

Finally, we obtain (13) by factoring the numerator one last time:

[TABLE]

thereby proving that if (13) holds for paths of length $n-1$ , then it also holds for paths of length $n$ . ∎

Lemma 2.

For any infinite path in the environmental MDP,

[TABLE]

Proof.

Factoring out the constant term in the denominator of the left-hand side of the claimed equation, we obtain the equivalent expression

[TABLE]

Since the environmental MDP is assumed to be finite, there are finitely many possible reward values and we can bound the summation in the above expression as

[TABLE]

where $r_{\downarrow}=\min_{(s,a)\in S\times A}R(s,a)$ and $r_{\uparrow}=\max_{(s,a)\in S\times A}R(s,a)$ . Lastly, noticing that

[TABLE]

it follows that

[TABLE]

∎

Now we are in position to prove Theorem 6.

Proof of Theorem 6.

In light of equation (10), it is sufficient to prove the identity $V^{\gamma}=\frac{V}{1-\gamma}$ . Applying 1, the average depreciating payoff may be rewritten as

[TABLE]

Distributing the product in the numerator and then breaking the summation into a difference of summations yields the expression

[TABLE]

By 2, the right-hand term in this difference tends to 0 as $n\to\infty$ , and so the above expression is equivalent to

[TABLE]

Factoring the constant term in the denominator out, the remaining limit-term is exactly the definition of the average payoff, and thus we conclude, for any state $s$ , that

[TABLE]

∎

As a direct consequence of Theorem 6, there exists a Blackwell optimal policy that is optimal for $V_{\lambda}^{\gamma}$ when $\lambda$ is sufficiently close to 1, that is also optimal for $V^{\gamma}$ .

Corollary 2.

There exists a discount factor $\lambda_{0}\in(0,1)$ and a policy $\pi$ such that, for all $\lambda\in[\lambda_{0},1)$ and every state $s$ , it holds that

[TABLE]

In turn, this implies the following result on the strategic complexity for the average depreciating payoff.

Corollary 3 (Strategic Complexity).

For any average depreciating payoff over any finite MDP, there exists an optimal policy that is stationary and deterministic.

5 Related Work

Discounted and average payoffs have played central roles in the theory of optimal control and reinforcement learning. A multitude of deep results exist connecting these objectives [Bewley and Kohlberg, 1976, 1978; Mertens and Neyman, 1981; Andersson and Miltersen, 2009; Chatterjee et al., 2011; Chatterjee and Majumdar, 2012; Ziliotto, 2016a, b, 2018] in addition to an extensive body of work on algorithms for related optimization problems and their complexity [Filar and Schultz, 1986; Raghavan and Filar, 1991; Raghavan and Syed, 2003; Chatterjee et al., 2008; Chatterjee and Ibsen-Jensen, 2015].

The value for the depreciating assets is defined as a past discounted sum of rewards. Past discounted sums for finite sequences were studied in the context of optimization [Alur et al., 2012] and are closely related to exponential recency weighted average, a technique used in nonstationary multi-armed bandit problems [Sutton and Barto, 2018] to estimate the average reward of different actions by giving more weight to recent outcomes. However, to the best of our knowledge, depreciating assets have not been formally studied as a payoff function.

Discounted objectives have found significant applications in areas of program verification and synthesis [de Alfaro et al., 2003; Cerný et al., 2011]. Although the idea of past operators is quite old [Lichtenstein et al., 1985], relatively recently a number of classical formalisms including temporal logics such as LTL and CTL and the modal $\mu$ -calculus have been extended with past-tense operators and with discounted quantitative semantics [de Alfaro et al., 2005; Almagor et al., 2014, 2016; Littman et al., 2017]. A particularly significant result [Markey, 2003] around LTL with classical boolean semantics is that, while LTL with past operators is no more expressive than standard LTL, it is exponentially more succinct. It remains open whether this type of relationship holds for other logics and their extensions by past operators when interpreted with discounted quantitative semantics [Almagor et al., 2016].

6 Conclusion

In the stochastic optimal control and reinforcement learning setting the agents select their actions to maximize a discounted payoff associated with the resulting sequence of scalar rewards. This interaction models the way dopamine driven organisms maximize their reward sequence based on their capability to delay gratification (discounting). While this paradigm provides a natural model in the context of streams of immediate rewards, when the valuations and objectives are defined in terms of assets that depreciate, the problem cannot be directly modeled in the classic framework. We initiated the study of optimization and learning for the depreciating assets, and showed a surprising connection between these problems and traditional discounted problems. Our result enables solving optimization problems under depreciation dynamics by tweaking the algorithmic infrastructure that has been extensively developed over the last several decades for classic optimization problems.

We believe that depreciating assets may provide a useful abstraction to a number of related problems. The following points sketch some of these directions and state several problems that remain open.

$\blacktriangleright$

Regret minimization [Cesa-Bianchi and Lugosi, 2006] is a popular criterion in the setting of online learning where a decision-maker chooses her actions so as to minimize the average regret—the difference between the realized reward and the reward that could have been achieved. We posit that imperfect decision makers may view their regret in a depreciated sense, since a suboptimal action in the recent past tends to cause more regret than an equally suboptimal action in the distant past. We hope that the results of this work spur further interest in developing foundations of past-discounted characterizations of regret in online learning and optimization.

$\blacktriangleright$

In solving multi-agent optimization problems, a practical assumption involves bounding the capability of any adversary by assuming that they have a limited memory of the history of interaction, and this can be modeled via a discounting of past outcomes. From our results it follows that two-player zero-sum games with depreciation dynamics under both discounted and average payoffs can be reduced to classic optimization games modulo some scaling of the immediate rewards.

$\blacktriangleright$

The notion of state-based discount factors has been studied in the context of classic optimization and learning. Is it possible to extend the results of this paper to the setting with state-dependent depreciation factors? This result does not directly follow from the tools developed in this paper, and it remains an open problem.

$\blacktriangleright$

Continuous-time MDPs provide a dense-time analog of discrete-time MDPs and optimization and RL algorithms for such systems are well understood. Is it possible to solve optimization and learning for CTMDPs with depreciating assets?

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ackerman [1973] Susan Rose Ackerman. Used cars as a depreciating asset. Economic Inquiry , 11(4):463, 1973.
2Almagor et al. [2014] Shaull Almagor, Udi Boker, and Orna Kupferman. Discounting in LTL. In Tools and Algorithms for the Construction and Analysis of Systems, TACAS , volume 8413 of LNCS , pages 424–439. Springer, 2014. URL https://doi.org/10.1007/978-3-642-54862-8_37 . · doi ↗
3Almagor et al. [2016] Shaull Almagor, Udi Boker, and Orna Kupferman. Formally reasoning about quality. J. ACM , 63(3):24:1–24:56, 2016. URL https://doi.org/10.1145/2875421 . · doi ↗
4Alur et al. [2012] Rajeev Alur, Loris D’Antoni, Jyotirmoy V. Deshmukh, Mukund Raghothaman, and Yifei Yuan. Regular functions, cost register automata, and generalized min-cost problems, 2012.
5Andersson and Miltersen [2009] Daniel Andersson and Peter Bro Miltersen. The complexity of solving stochastic games on graphs. In Algorithms and Computation ISAAC , volume 5878 of LNCS , pages 112–121. Springer, 2009. URL https://doi.org/10.1007/978-3-642-10631-6_13 . · doi ↗
6Banach [1922] Stefan Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. math , 3(1):133–181, 1922.
7Beckerman [1991] Paul Beckerman. The economics of high inflation . Springer, 1991.
8Bewley and Kohlberg [1976] Truman Bewley and Elon Kohlberg. The asymptotic theory of stochastic games. Mathematics of Operations Research , 1(3):197–208, 1976.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Reinforcement Learning with Depreciating Assets

Abstract

1 Introduction

1.1 Asset Depreciation

Example 1** (Used Car Dealership).**

1.2 Discounted Depreciating Payoff

1.3 Average Depreciating Payoff

Contributions.

Organization.

2 Preliminaries

2.1 Markov Decision Processes

Payoffs, Policies, and Optimality.

Strategic Complexity.

2.2 Discounted and Average Payoffs

Theorem 1**.**

2.3 Reinforcement Learning

Theorem 2** (Watkins and Dayan [1992]).**

2.4 Depreciating Assets

3 Discounted Depreciating Payoff

Example 2** (Used Car Dealership Cont.).**

Theorem 3** (Optimality Equation).**

Proof.

⌜\ulcorner┌

Corollary 1** (Strategic Complexity).**

Theorem 4** (Computational Complexity).**

Proof.

Theorem 5**.**

Proof.

3.1 Discussion

4 Average Depreciating Payoff

Theorem 6** (Tauberian Theorem).**

Lemma 1**.**

Proof.

Base case.

Inductive case.

Lemma 2**.**

Proof.

Proof of Theorem 6.

Corollary 2**.**

Corollary 3** (Strategic Complexity).**

5 Related Work

6 Conclusion

Example 1 (Used Car Dealership).

Theorem 1.

Theorem 2 (Watkins and Dayan [1992]).

Example 2 (Used Car Dealership Cont.).

Theorem 3 (Optimality Equation).

$\ulcorner$

Corollary 1 (Strategic Complexity).

Theorem 4 (Computational Complexity).

Theorem 5.

Theorem 6 (Tauberian Theorem).

Lemma 1.

Lemma 2.

Corollary 2.

Corollary 3 (Strategic Complexity).