Guaranteed satisficing and finite regret: Analysis of a cognitive   satisficing value function

Akihiro Tamatsukuri; Tatsuji Takahashi

arXiv:1812.05795·cs.AI·April 16, 2025

Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function

Akihiro Tamatsukuri, Tatsuji Takahashi

PDF

Open Access

TL;DR

This paper introduces a risk-sensitive satisficing (RS) model for reinforcement learning that guarantees finding satisfactory actions and ensures finite regret in bandit problems, offering a practical alternative to optimality-focused methods.

Contribution

The paper presents the RS model that guarantees satisficing solutions and finite regret, with theoretical proofs and empirical validation in bandit tasks.

Findings

01

RS guarantees finding an action above the aspiration level.

02

Expected regret of RS is finite under optimal aspiration levels.

03

Numerical simulations confirm theoretical results and compare favorably with other algorithms.

Abstract

As reinforcement learning algorithms are being applied to increasingly complicated and realistic tasks, it is becoming increasingly difficult to solve such problems within a practical time frame. Hence, we focus on a \textit{satisficing} strategy that looks for an action whose value is above the aspiration level (analogous to the break-even point), rather than the optimal action. In this paper, we introduce a simple mathematical model called risk-sensitive satisficing ( $R S$ ) that implements a satisficing strategy by integrating risk-averse and risk-prone attitudes under the greedy policy. We apply the proposed model to the $K$ -armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. The first is that $R S$ is guaranteed to find an action whose value is above the aspiration level. The second is that the regret (expected…

Figures3

Click any figure to enlarge with its caption.

Equations132

regret (n) = i = 1 \sum K (p_{i^{*}} - p_{i}) E [n_{i} (n)],

regret (n) = i = 1 \sum K (p_{i^{*}} - p_{i}) E [n_{i} (n)],

E_{i} = n_{i}^{1} / (n_{i}^{1} + n_{i}^{0}),

E_{i} = n_{i}^{1} / (n_{i}^{1} + n_{i}^{0}),

δ_{i} = E_{i} - R .

δ_{i} = E_{i} - R .

R S_{i} = n_{i} δ_{i} = n_{i} (E_{i} - R) .

R S_{i} = n_{i} δ_{i} = n_{i} (E_{i} - R) .

R = (p_{1st} + p_{2nd}) /2.

R = (p_{1st} + p_{2nd}) /2.

E (a_{i}, s) = \frac{n _{i}^{1} ( s )}{n _{i} ( s )}

E (a_{i}, s) = \frac{n _{i}^{1} ( s )}{n _{i} ( s )}

RS(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R\bigr{)}.

RS(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R\bigr{)}.

P\Bigl{(}\operatorname*{\mathrm{arg\leavevmode\nobreak\ max}}_{a_{i}}RS(a_{i},s)\in A_{U}\Bigr{)}=1\,\,\,(s\rightarrow\infty).

P\Bigl{(}\operatorname*{\mathrm{arg\leavevmode\nobreak\ max}}_{a_{i}}RS(a_{i},s)\in A_{U}\Bigr{)}=1\,\,\,(s\rightarrow\infty).

\forall\,\,\,i\in I_{L},\,\,\,P\bigl{(}\#N_{i}=\infty\Leftrightarrow RS(a_{i},s)\rightarrow-\infty\,\,\,(s\rightarrow\infty)\bigr{)}=1.

\forall\,\,\,i\in I_{L},\,\,\,P\bigl{(}\#N_{i}=\infty\Leftrightarrow RS(a_{i},s)\rightarrow-\infty\,\,\,(s\rightarrow\infty)\bigr{)}=1.

R S (a_{i}, s)

R S (a_{i}, s)

\displaystyle<n_{i}(s)\cdot\bigl{(}p_{i}+\frac{R-p_{i}}{2}-R\bigr{)}

= n_{i} (s) \cdot \frac{p _{i} - R}{2} < 0.

\exists i \in I_{U}, P (# N_{i} = \infty) = 1.

\exists i \in I_{U}, P (# N_{i} = \infty) = 1.

\displaystyle P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\bigm{|}\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}=1.

\displaystyle P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\bigm{|}\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}=1.

\displaystyle P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\,\,\,\,\text{and}\,\,\,\,\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}=0.

\displaystyle P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\,\,\,\,\text{and}\,\,\,\,\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}=0.

\displaystyle P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\,\,\,\,\text{and}\,\,\,\,\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}

\displaystyle P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\,\,\,\,\text{and}\,\,\,\,\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}

\displaystyle=P\bigl{(}\exists j\in I_{L},\,\,\,RS(a_{j},s)\rightarrow-\infty\bigm{|}\forall i\in I_{U},\,\,\,\#N_{i}<\infty\bigr{)}P(\forall i\in I_{U},\,\,\,\#N_{i}<\infty).

P\Bigl{(}\operatorname*{\mathrm{arg\leavevmode\nobreak\ max}}_{a_{i}}RS(a_{i},s)\in A_{U}\Bigr{)}=1\,\,\,(s\rightarrow\infty).

P\Bigl{(}\operatorname*{\mathrm{arg\leavevmode\nobreak\ max}}_{a_{i}}RS(a_{i},s)\in A_{U}\Bigr{)}=1\,\,\,(s\rightarrow\infty).

R S (a_{k}, s)

R S (a_{k}, s)

\displaystyle>n_{k}(s)\cdot\bigl{(}p_{k}+\frac{R-p_{k}}{2}-R\bigr{)}

= n_{k} (s) \cdot \frac{p _{k} - R}{2} > 0.

R S (a_{i}, s)

R S (a_{i}, s)

= n_{i}^{1} (s) - n_{i} (s) R

= (X_{i, 1} - R) + (X_{i, 2} - R) + \dots + (X_{i, n_{i} (s)} - R)

E [Δ R S_{i} (s)]

E [Δ R S_{i} (s)]

= {(p_{1} - p_{i}) /2} (n_{1} (s) + n_{i} (s))

+ {(p_{1} + p_{i}) /2 - R} (n_{1} (s) - n_{i} (s)) .

V [Δ R S_{i} (s)]

E [Δ R S_{i} (s)]

E [Δ R S_{i} (s)]

+ {(p_{i} - p_{2}) /2} (n_{1} (s) - n_{i} (s)) .

E [Δ R S_{i} (s)]

E [Δ R S_{i} (s)]

= {(p_{1} - p_{2}) /2} s .

V [Δ R S_{i} (s)]

where σ_{1, i} = max (σ_{1}, σ_{i}) .

P [s = n + 1, I = i]

P [s = n + 1, I = i]

\leq P [Δ R S_{i} (n) \leq 0]

\displaystyle=Q\biggl{(}\frac{E[\Delta RS_{i}(n)]}{\sqrt{V[\Delta RS_{i}(n)]}}\biggr{)}

\displaystyle\leq Q\biggl{(}\frac{(p_{1}-p_{2})\sqrt{n}}{2\sigma_{1,i}}\biggr{)}

= Q (ϕ_{i} n),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management

Full text

Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function

Akihiro Tamatsukuri

Graduate School of Advanced Science and Engineering, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama 350-0394, Japan

Tatsuji Takahashi

School of Science and Engineering, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama 350-0394, Japan

Dwango Artificial Intelligence Laboratory, 5-24-5 Hongo, Bunkyo, Tokyo 113-0033, Japan

[email protected]

Abstract

As reinforcement learning algorithms are being applied to increasingly complicated and realistic tasks, it is becoming increasingly difficult to solve such problems within a practical time frame. Hence, we focus on a satisficing strategy that looks for an action whose value is above the aspiration level (analogous to the break-even point), rather than the optimal action. In this paper, we introduce a simple mathematical model called risk-sensitive satisficing ( $RS$ ) that implements a satisficing strategy by integrating risk-averse and risk-prone attitudes under the greedy policy. We apply the proposed model to the $K$ -armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. The first is that $RS$ is guaranteed to find an action whose value is above the aspiration level. The second is that the regret (expected loss) of $RS$ is upper bounded by a finite value, given that the aspiration level is set to an “optimal level” so that satisficing implies optimizing. We confirm the results through numerical simulations and compare the performance of $RS$ with that of other representative algorithms for the $K$ -armed bandit problems.

Introduction

Reinforcement learning (RL), a framework for learning and control in which agents search for proper actions in an environment through trial and error, has witnessed rapid development in recent years, as evidenced by the super-human performances of deep Q-networks (DQN) [1] in video game playing and AlphaGo [2] in the game of Go. Moreover, the application range of RL extends not only to more complicated tasks on computers but also to the control of robots [3] and unmanned aerial vehicles (UAVs) [4] in the real world.

As RL algorithms are being applied to increasingly complicated and realistic tasks, the limits of sensors, processors, and actuators of agents are posing serious obstacles for conventional optimization algorithms. Simon proposed the notion of bounded rationality as the principle underlying agents’ behavior under resource limits [5]. A bounded rational agent may appear to behave irrationally, but by considering the limits and constraints, the agent’s behavior can be understood as rational. Bounded rationality has attracted considerable attention in recent years. Computational rationality [6], which has been claimed to integrate the three fields of neuroscience (brain), cognitive science (mind), and artificial intelligence (machine) [7], is an updated form of bounded rationality. Further, it has been proposed that abstraction and hierarchy, which have been considered to enable flexible and efficient cognition of humans [8], result from the above-mentioned limitations and are bounded rational [9].

The representative decision making policy in the theory of bounded rationality is satisficing [10, 11]. Satisficing agents do not keep searching for the optimal action; instead, they stop searching when an action whose quality is above a certain level (aspiration) is found. The satisficing strategy has not attracted much attention in reinforcement learning, except for a few studies [12, 13] (to be discussed later). In previous studies [14, 15], one of the authors proposed a simple satisficing value function called risk-sensitive satisficing ( $RS$ ) and empirically validated its effectiveness through numerical simulations of reinforcement learning tasks.

In this paper, we apply $RS$ to the $K$ -armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. First, we prove that $RS$ is guaranteed to find a satisfactory action: if the $RS$ agent chooses an action in each trial and the number of trials is sufficient, the agent can stably choose an action whose value is above the aspiration level. Second, we prove the finiteness of the regret of $RS$ . In general, the performance of algorithms in the $K$ -armed bandit problems is measured by how small their regret (expected loss) is. It is known that the regret increases at least in the logarithmic order with the number of trials [16]. Therefore, the regret increases infinitely as the trials are repeated. However, we prove that if a small amount of information on the reward distributions is available so that the aspiration level is set to an “optimal level” (hence, satisficing entails optimizing), then the regret of $RS$ is upper bounded by a finite value. We confirm these results by numerical simulations and compare the performance of $RS$ with that of other representative algorithms for the $K$ -armed bandit problems. Finally, we conclude the paper with a discussion on the possible applications of $RS$ and the theoretical significance of this work.

Methods

$K$ -armed Bandit Problems

The $K$ -armed bandit problems that we deal with in this paper are as follows. Let there be $K$ actions $\{a_{1},a_{2},\dotsc,a_{K}\}$ that lead to a reward of 1 or 0 according to the reward probabilities $\{p_{1},p_{2},\dotsc,p_{K}\}$ , which are unknown to the agent. If the agent chooses action $a_{i}$ , it acquires a reward of 1 with probability $p_{i}$ or a reward of 0 with probability $1-p_{i}$ . The goal of the repetition of choice is maximization of the expected accumulated rewards, which is measured by minimization of regret (the expected cumulative loss). $a_{i}^{\ast}$ denotes the action with the maximal reward probability (i.e., $p_{i^{\ast}}=\max_{i}p_{i}$ ). The regret when the $n$ -th step (one step means one trial) ends is defined as follows.

[TABLE]

where $n_{i}(n)$ is the number of times action $a_{i}$ is chosen from the first to the $n$ -th step (simply written as $n_{i}$ when the number of steps is not explicitly indicated) and $E[\,\cdot\,]$ is the expectation. Regret represents the expected loss, i.e., “how inferior the cumulative expected reward from the actual chosen actions is to the cumulative expected reward when the optimal action continues to be chosen from the first step?” The smaller the regret, the better is the performance of the algorithms. The minimum value of the regret is zero when the optimal action has been chosen in all the steps. It has been proven that the regret increases at least in $\mathcal{O}(\log n)$ with the number of steps $n$ [16].

As for action selection by the agent, the basic policy is to take the action with the highest value (the greedy method). The basic valuation of action $a_{i}$ is based on its mean reward:

[TABLE]

where $n_{i}^{r}$ is the number of times $a_{i}$ is chosen and the reward $r$ is acquired. $n_{i}$ , i.e., the number of times the action $a_{i}$ is chosen, satisfies $n_{i}=n_{i}^{1}+n_{i}^{0}$ and $n=\sum_{i=1}^{K}n_{i}$ . Under the greedy method with the mean reward valuation, if there is a non-optimal action $a_{i}\,\,\,(i\neq i^{\ast})$ that has a high value in early trials, there is a risk of $a_{i}$ being chosen all along. Each of the other actions must be tried for an appropriate number of times so that the optimal action is found in a timely manner. Merely choosing the action with the highest value based on the accumulated knowledge (exploitation) does not suffice, and various actions must be tried (exploration). Various algorithms have been proposed to balance exploitation and exploration.

Models of Satisficing

We introduce two models of satisficing at the levels of policy and value function. The policy model follows the standard description of satisficing. The second model is the risk-sensitive value function that we analyze and test in this paper. The former is tested through simulations for comparison with the latter.

Policy Satisficing ( $PS$ ) Model

A standard definition of satisficing is to keep exploring until an action whose value is above the aspiration level $R$ is found and to then stop searching and keep choosing the action (exploit). Satisficing, unlike optimization, can reduce the search cost because it does not involve searching for all actions and deciding on the optimal action. This is formulated as a policy (of reinforcement learning) as follows. If there exists at least one action whose mean reward is above the aspiration level $R$ , exploitation (following the greedy method) is executed. Otherwise, when the mean reward of all the actions is below the aspiration level $R$ , an action is randomly chosen. We refer to this algorithm as policy satisficing ( $PS$ ).

Risk-sensitive Satisficing ( $RS$ ) Value Function

One of the authors has proposed a value function called risk-sensitive satisficing ( $RS$ ) that realizes satisficing action selection behavior when operated under the greedy policy [14, 15] (see Supplementary Information for its relationship with other models). Before introducing the model, we first define the difference between the mean reward $E_{i}$ of action $a_{i}$ and the aspiration level $R$ :

[TABLE]

If there exists a positive $\delta_{i}$ , then the agent will choose such $a_{i}$ and be satisfied; otherwise, it will be unsatisfied. $RS$ is defined as follows [14]:

[TABLE]

This value is used under the greedy policy: the agent chooses the action $a_{i}$ with the maximal $RS_{i}$ value.

$RS$ integrates two risk-sensitive satisficing behaviors. When unsatisfied, $RS$ is risk-seeking, leading to optimistic exploration. If $\delta_{i}<0$ for all $i$ , then actions with smaller $n_{i}$ are prioritized. Let $R=0.7$ and let there be two unsatisfactory actions $a_{1}$ and $a_{2}$ with $E_{1}=0.4<E_{2}=0.6$ and $n_{1}=7,n_{2}=2$ . Then, $RS_{1}=-2.1<RS_{2}=-0.2$ ; hence, $a_{2}$ is chosen. This preference of a less tried action can be interpreted as the optimistic expectation of the action’s actual reward probability $p_{i}$ being set above $R$ . There might be some $p_{i}>R$ ; however, thus far, $E_{i}<R$ for all the actions. In terms of looking for a satisfactory action, it is rational to try actions with smaller $n_{i}$ . This accords with the motto “optimism in the face of uncertainty,” which is considered a general and rational exploration strategy in reinforcement learning [17]. The UCB model described later implements this idea [18].

When satisfied, $RS$ is risk-averse, performing pessimistic exploitation. If there is only one $a_{i}$ for which $\delta_{i}$ is positive, the agent will keep choosing it. If there are multiple actions with positive $\delta_{i}$ , then the actions with larger $n_{i}$ are prioritized. Let $R=0.3$ , and let there be two satisfactory actions $a_{1}$ and $a_{2}$ with $E_{1}=0.4<E_{2}=0.6$ and $n_{1}=7,n_{2}=2$ that are equivalent to the example above. Then, $RS_{1}=0.7>RS_{2}=0.6$ ; hence, $a_{1}$ is chosen. In this case, a more tried action is preferred. This can be interpreted as the pessimistic expectation of the action’s actual reward probability $p_{i}$ being set below $R$ . It is possible that $a_{i}$ is a spuriously satisfactory action with $E_{i}>R$ ; however, $p_{i}<R$ . In terms of looking for a truly satisfactory action and avoiding spuriously satisfactory ones, it is rational to try actions with $E_{i}>R$ for a larger $n_{i}$ .

Setting of the Aspiration Level

The aspiration level $R$ defines the boundary between satisfactory and unsatisfactory, analogous to the break-even point between gain and loss or the neutral reference outcome in prospect theory [19]. It can be set according to the internal need for it or its knowledge of the environment. As an ecological example, let the agent be an animal, and let the rewards 1 and 0 represent the presence and absence of food. If the action is to look for food at a feeding ground from among multiple grounds and the agent has to obtain food around once every two days for survival, then $R$ would be $0.5$ or higher.

Optimization can be viewed as a special case of satisficing. If $R$ lies between the two reward probabilities of the optimal and second-optimal actions, then satisficing above $R$ means optimizing. Let us call such $R$ “an optimal aspiration level”. Let the highest reward probability be $p_{\mathrm{1st}}$ and the second-highest one be $p_{\mathrm{2nd}}$ . $R$ can be set optimally as follows:

[TABLE]

It is known that the regret increases at least in $\mathcal{O}(\log n)$ with the number of steps $n$ [16]. This is the result of assuming no knowledge of the agent on the reward distribution. By relaxing this assumption and allowing $R$ to be set as in Eq. 5, it will be shown that the regret is upper bounded by a finite value as in Proposition 2 described later.

Note that having an optimal aspiration level does not make a $K$ -armed bandit problem trivial. Even if we know a point between the optimal and second-optimal actions, we do not know exactly which action is optimal. Efficient identification of such an action is not trivial. In the next section, $RS$ will be compared in terms of its performance with other algorithms, one of which needs some similar information on the reward distribution to be optimal.

Results

Analysis

We perform theoretical analysis of the basic satisficing and optimizing properties of $RS$ . First, in Proposition 1, we prove that $RS$ can stably choose actions above the aspiration level after a sufficient number of steps. Second, in Proposition 2, we prove that the regret of $RS$ is upper bounded when an optimal aspiration level is given and satisficing becomes optimizing.

Guarantee of Satisficing

In the proof of Proposition 1, we adopt symbols clearly indicating the step number ( $s$ ) and the chosen action ( $a_{i}$ ) as follows. Both of the following represent values after $s$ steps: the mean reward

[TABLE]

and the $RS$ value

[TABLE]

Proposition 1 (Theoretical Guarantee of Satisficing).

Let $p_{i}$ be the reward probability of action $a_{i}$ $(i=1,2,\dotsc,K)$ . Let $A_{U}$ be the set of actions whose reward probability is greater than the aspiration level $R$ , and let $A_{L}$ be the set of actions whose reward probability is smaller than $R$ . Let $I_{U}=\{i\mid p_{i}>R\}$ , $I_{L}=\{i\mid p_{i}<R\}$ and $A_{U}=\{a_{i}\mid i\in I_{U}\}$ , $A_{L}=\{a_{i}\mid i\in I_{L}\}$ , where $A_{U}$ is supposed to be a non-empty set. Then, the following holds for $RS$ .

After a sufficient number of steps, a satisfactory action $a_{i}$ with $p_{i}>R$ will be always chosen, and this state is stable.

In other words, by letting $P(A)$ be the probability that event $A$ will occur,

[TABLE]

Subsequently, by $N_{j}=\Bigl{\{}s\Bigm{|}\operatorname*{\mathrm{arg\leavevmode\nobreak\ max}}\limits_{a}RS(a,s)=a_{j}\Bigl{\}}$ , we denote the set of steps in which action $a_{j}$ is chosen. Let $\#N$ be the number of elements in set $N$ . First, we prove two claims.

Claim A.

[TABLE]

Proof.

(Claim A) ( $\Leftarrow$ ) Suppose that $i\in I_{L}$ and $RS(a_{i},s)\rightarrow-\infty\,\,\,(s\rightarrow\infty)$ . If $\#N_{i}<\infty$ , $RS(a_{i},s)$ is constant for $s$ greater than or equal to some number. This is a contradiction; hence, we have $\#N_{i}=\infty$ . ( $\Rightarrow$ ) Suppose that $i\in I_{L}$ and $\#N_{i}=\infty$ . By the law of large numbers, for any positive number $\epsilon$ , there exists some $S$ such that we have $P\bigl{(}|E(a_{i},s)-p_{i}|<(R-p_{i})/2\bigr{)}>1-\epsilon$ for any integer $s\in N_{i}$ greater than $S$ . Now, if $|E(a_{i},s)-p_{i}|<(R-p_{i})/2$ , we have

[TABLE]

As $s\rightarrow\infty$ , we have $n_{i}(s)(p_{i}-R)/2\rightarrow-\infty$ ; hence, $RS(a_{i},s)\rightarrow-\infty$ . Therefore, $P\bigl{(}RS(a_{i},s)\rightarrow-\infty\bigm{|}\#N_{i}=\infty\bigr{)}>1-\epsilon$ . Since $\epsilon$ is arbitrary, we obtain $P\bigl{(}RS(a_{i},s)\rightarrow-\infty\bigm{|}\#N_{i}=\infty\bigr{)}=1$ . ∎

Claim B.

[TABLE]

Proof.

(Claim B) We assume that for any $i\in I_{U}$ , $\#N_{i}<\infty$ . Then, for any $i\in I_{U}$ , $RS(a_{i},s)$ is constant for any $s$ greater than or equal to some number. Furthermore, for some $j\in I_{L}$ , we have $\#N_{j}=\infty$ . Hence, by Claim A, we have

[TABLE]

However, the following statements contradict each other: (i) $RS(a_{j},s)\rightarrow-\infty$ , (ii) $\forall i\in I_{U}$ , $RS(a_{i},s)=\mathrm{const.}$ for any $s$ greater than or equal to some number. Hence, we obtain

[TABLE]

Now, the following formula holds.

[TABLE]

Therefore, we must have $P(\forall i\in I_{U},\,\,\,\#N_{i}<\infty)=0$ . ∎

Proposition 1 (again).

[TABLE]

Proof.

(Proposition1) By Claim B, we have $\exists k\in I_{U}$ , $\#N_{k}=\infty$ . By the law of large numbers, for any positive number $\epsilon$ , there exists some $S$ such that we have $P\bigl{(}|E(a_{k},s)-p_{k}|<(p_{k}-R)/2\bigr{)}>1-\epsilon$ for any integer $s\in N_{k}$ greater than $S$ . Now, if $|E(a_{k},s)-p_{k}|<(p_{k}-R)/2$ , we have

[TABLE]

Hence, we have $P\bigl{(}\text{for sufficiently large }s,\,\,\,RS(a_{k},s)>0\bigr{)}>1-\epsilon$ . Since $\epsilon$ is arbitrary, we obtain $P\bigl{(}\text{for sufficiently large }s,\,\,\,\allowbreak RS(a_{k},s)>0\bigr{)}=1$ .

Here, we assume that there exists $i\in I_{L}$ such that $\#N_{i}=\infty$ . Then, we may have $RS(a_{i},s)\rightarrow-\infty$ by Claim A. On the other hand, $\#N_{i}<\infty$ follows from $RS(a_{i},s)\rightarrow-\infty$ because $RS(a_{k},s)>0$ for any sufficiently large $s$ . However, $\#N_{i}=\infty$ and $\#N_{i}<\infty$ contradict each other, which means that the initial assumption must be false. Hence, for any $i\in I_{L}$ , $P(\#N_{i}<\infty)=1$ holds. Therefore, the results obtained are summarized as $\exists k\in I_{U}$ , $P(\#N_{k}=\infty)=1$ and $\forall i\in I_{L}$ , $P(\#N_{i}<\infty)=1$ . From these results, the following follows immediately. $\displaystyle P\Bigl{(}\operatorname*{\mathrm{arg\leavevmode\nobreak\ max}}_{a_{i}}RS(a_{i},s)\in A_{U}\Bigr{)}=1\,\,\,(s\rightarrow\infty)$ . ∎

Theoretical Analysis of Regret

We prove that $RS$ is upper bounded by a finite value when the level $R$ is set to the optimal aspiration level.

Proposition 2 (Finiteness of Regret of $RS$ ).

Let the highest reward probability of all the actions be $p_{1}$ and the second-highest reward probability be $p_{2}$ . Further, we set $R$ as $R=(p_{1}+p_{2})/2$ (an optimal aspiration level). Then, the following holds for $RS$ :

“There exists a monotonically increasing function $f(s)$ for step number $s$ such that $regret(s)<f(s)$ . Then, $f(s)\rightarrow M\,\,\,(s\rightarrow\infty)$ , where $M$ is constant. Thus, $\mathrm{regret}(s)<M$ ”.

We conceived the following proof by referring the papers[20, 21, 22] on TOW (tug-of-war) dynamics model (hereinafter simply referred to as TOW). TOW is similar to $RS$ (See Supplementary Information for the similarities and differences between $RS$ and TOW). However, in their paper, the analysis of the finiteness of the regret by TOW was strictly limited to cases in which there are only two actions and the variances of the reward probabilities are equal. In the case of the bandit problems with the reward following the Bernoulli distributions, equal variance implies $p_{1}=p_{2}$ or $p_{2}=1-p_{1}$ . (Let $V_{i}$ be the variance of action $a_{i}$ . $V_{1}=V_{2}\Leftrightarrow p_{1}(1-p_{1})=p_{2}(1-p_{2})\Leftrightarrow(p_{1}-p_{2})\{1-(p_{1}+p_{2})\}=0\Leftrightarrow p_{1}=p_{2}$ or $p_{2}=1-p_{1}$ .) Thus, the equal variance is a strong assumption. Here, we generalize the proof to prove finite regret with $K$ arms ( $K\geq 2$ ) and without assuming equal variance.

Proof.

(Proposition2) Suppose that $p_{1}>p_{2}>p_{i}\,\,\,(i\neq 1,2)$ . Let $RS(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R\bigr{)}\,\,\,(i=1,2,\dotsc,K)$ . The expectation $E$ and the variance $V$ of $RS(a_{i},s)$ are $E[RS(a_{i},s)]=n_{i}(s)\leavevmode\nobreak\ (p_{i}-R)$ and $V[RS(a_{i},s)]=n_{i}(s)\sigma_{i}^{2}$ , respectively, where $\sigma_{i}^{2}=p_{i}(1-p_{i})$ .

Note that

[TABLE]

holds, where $X_{i,j}=1\text{\,or \,}0$ , indicating the reward when action $a_{i}$ was chosen in the $j$ -th time. Let $\Delta RS_{i}(s)=RS(a_{1},s)-RS(a_{i},s)\,\,\,(i\neq 1)$ . Then,

[TABLE]

Since $(p_{1}+p_{2})/2=R$ ,

[TABLE]

By Proposition 1, if the step number $s$ is sufficiently large, then $n_{1}(s)\rightarrow s$ with probability 1111This is an approximation. Also, it is not mathematically strict to fix $n_{i}(s)$ when calculating the expected value and the variance of $RS(a_{i},s)$ , and to assume that the trials are independent, when applying the central limit theorem. It is possible that the calculated upper bound of the regret is not accurate due to the errors resulting from the approximation and/or the above-mentioned assumption. However, the validity of the upper bound is empirically confirmed as shown in Fig. 1 and 2..

Hence,

[TABLE]

By Eq. (16) and the central limit theorem, $\Delta RS_{i}(s)$ follows the normal distribution with expectation $E[\Delta RS_{i}(s)]$ and variance $V[\Delta RS_{i}(s)]$ . The probability that $\Delta RS_{i}(s)<0$ is $Q(E[\Delta RS_{i}(s)]/\sqrt{V[\Delta RS_{i}(s)]})$ . Here, $Q(x)$ is the $Q$ -function, which represents the tail distribution function of the standard normal distribution. Thus, $Q(x)=(1/\sqrt{2\pi})\cdot\int_{x}^{\infty}\exp(-t^{2}/2)\,dt$ . Let $P[s=n+1,I=i]$ be the probability that action $a_{i}$ is chosen in the $(n+1)$ -th step.

Then, $P[s=n+1,I=i]$ is given by

[TABLE]

where we set $\phi_{i}=(p_{1}-p_{2})/(2\sigma_{1,i})$ .

By using the Chernoff bound $Q(x)\leq(1/2)\exp(-x^{2}/2)$ , we evaluate the upper bound of the regret.

[TABLE]

Therefore,

[TABLE]

This concludes the proof.

∎

When the Aspiration Level $R$ is Variable

Both of Propositions 1 and 2 assumed that the aspiration level $R$ is constant. When $R$ is variable or stochastic, similar propositions can be established just by slightly modifying the previous proofs assuming that $R$ is within a certain range. See Supplementary Information C for the modifications. The generalization assures that the upper bound of regret stays finite even when $R$ is not initially set $p_{2}<R<p_{1}$ but converges within $p_{2}<R<p_{1}$ after some finite time step.

Empirical Verification

We verify the proven properties through simulations. As in Proposition 2, $R=(p_{1}+p_{2})/2$ , where $p_{1}>p_{2}>p_{i}\,\,\,(i\neq 1,2)$ . All the results below are the averaged results of 1,000 simulations. As an additional performance index, we consider accuracy, which is the proportion of the simulations in which the algorithm chose the optimal action in each step. Thus, the accuracy in the $t$ -th step is as follows.

accuracy = (Number of times action $a_{1}$ with the highest reward probability $p_{1}$ is chosen in the $t$ -th step) / (Total number of simulations).

First, we test whether the difference in reward probabilities can be detected, even if the difference is small, when the optimal aspiration level is set for $RS$ . We test it with $K=2$ where $(p_{1},p_{2})=(0.51,0.49),\,\,\,(0.501,0.499)$ . The result is shown in Fig. 1. The dotted line at the top in Fig. 1 (b) represents the upper bound of the regret shown by Proposition 2. We see that the accuracy nearly reaches 1 after $10^{6}$ steps, even if the difference is only 0.002 as in $(0.501,0.499)$ . Moreover, we see that the regret does not exceed the upper bound (Eq. (27)) calculated by Proposition 2.

Next, we conduct simulations to confirm the propositions with $K=10$ . The reward probability of each action is generated uniformly randomly from $[0,1]$ . The result is shown in Fig. 2. We can see that the accuracy converges to 1 and the regret does not exceed the upper bound (Eq. (27)) calculated by Proposition 2. Here, the calculated upper bound of the regret for $K=10$ is considerably higher than the actual regret compared with the case of $K=2$ . As we evaluate the probability of choosing action $a_{i}$ only by comparing $a_{i}$ with action $a_{1}$ having the highest reward probability as shown in Eq. (22) in the proof of Proposition 2, the probability of choosing $a_{i}$ is increasingly overestimated as the number of actions increases.

Comparison with Other Algorithms

Here, we clarify the performance and properties of $RS$ by comparing it with some representative algorithms for the $K$ -armed bandit problems, namely UCB1-Tuned and $\epsilon_{n}\text{-greedy}$ [18]

UCB1-Tuned

Upper confidence bound (UCB) is an algorithm based on the idea that the value of relatively less tried actions (more uncertain) is potentially high, similar to $RS$ ’s risk-seeking evaluation when unsatisfied [18]. The regret of UCB is guaranteed to increase in the logarithmic order, which is the theoretical limit [16]. We include the result of UCB1-Tuned (hereinafter referred to as UCB1T), which shows better performance compared to UCB1.

[TABLE]

Here, $V_{i}(n_{i})=v_{i}+\sqrt{2\ln{n}/n_{i}}$ , and $v_{i}$ is the variance of the reward from choosing action $a_{i}$ . Further, 1/4 is the upper bound of the variance of the random variable following the binomial distribution. In the algorithm, the action with the highest UCB1T value is chosen (the greedy method). The first term $E_{i}$ of UCB1T, which is the mean reward, represents the already acquired knowledge (and its exploitation), whereas the second term, which decreases as action $a_{i}$ is tried more, expresses the (un-)reliability of $E_{i}$ (which leads to exploration). When $n_{i}=0$ , the second term cannot be calculated, but in the first $K$ steps, each action is chosen once so that the value of the second term for all the actions is subsequently finite.

$\epsilon_{n}\text{-greedy}$

To set the level $R$ such that satisficing implies optimization, it is necessary to have some point in the interval between the highest and second-highest reward probabilities, usually unknown to the agent. Thus, having such “optimal” $R$ is a type of “cheating”. However, when such information is available, it should be utilized well, and $RS$ does so.

Furthermore, there is another algorithm, namely $\epsilon_{n}\text{-greedy}$ [18], which requires similar information for optimal performance. In this algorithm, the probability of random action selection, $\epsilon_{n}$ , is gradually reduced by annealing so that the regret of $\epsilon_{n}\text{-greedy}$ is guaranteed to be of the logarithmic order. It starts with maximal exploration (random action selection) and then gradually shifts to more exploitation as the information of the environment gets accumulated. In $\epsilon_{n}\text{-greedy}$ , there are two parameters $c$ and $d$ that are set as $c>0$ and $0<d<1$ . When there are $K$ arms, the stepwise decreasing sequence $\epsilon_{n}\in(0,1],\,\,\,n=1,2,\dotsc$ is defined as follows:

[TABLE]

The agent chooses action $a_{i}$ with the highest mean reward with probability $1-\epsilon_{n}$ , and it chooses a random action with probability $\epsilon_{n}$ for $n=1,2,\dotsc$ Let $p_{\mathrm{1st}}$ be the highest reward probability, and define $\Delta_{i}=p_{\mathrm{1st}}-p_{i}$ . Then, the parameter $d$ needs to satisfy

[TABLE]

Further, $\min\Delta_{i}=p_{\mathrm{1st}}-p_{\mathrm{2nd}}$ needs to be known in advance. Thus, some information about the reward probabilities is required, as in the case of $RS$ with the optimal aspiration level. In addition, the performance of $\epsilon_{n}\text{-greedy}$ is sensitive to the value of the parameter $c>0$ , and it is difficult to find the optimal value of $c$ [18].

On the other hand, determining the optimal aspiration level $R$ for $RS$ may be easier. It does not require a parameter like $c$ , and $(p_{\mathrm{1st}}+p_{\mathrm{2nd}})/2$ is sufficient. More generally, it is sufficient to obtain the interval $[p_{\mathrm{2nd}},p_{\mathrm{1st}}]$ or the value of any point within the interval.

Existing Satisficing Models

Here, we introduce the existing satisficing models and briefly explain the difference between those models and $RS$ . First, the framework that is the closest to ours is that of Bendor et al. on the heuristics of satisficing [12], which analyzes the two-armed bandit problems when the rewards are Bernoulli distributed. They mainly analyzed the limiting behavior of the policy model similar to $PS$ . Their model is different from $PS$ in that it gives a probability parameter of switching actions with a certain probability (not always), when unsatisfied. Therefore, the performance of their model is lower than that of $PS$ .

The most recent and comprehensive study was conducted by Reverdy et al. [13] They decomposed satisficing into “satisfy” and “suffice” (from which the word “satisfice” is formed) and presented general problem settings that include the standard bandit problems and algorithms with optimal order. As their algorithm is an adaptation of the standard UCB [18], the difference between $RS$ and their algorithm is similar to the difference between $RS$ and UCB as described above. Furthermore, their analysis is limited to the bandit problems where the reward distributions are Gaussian. In their study, they extended the concept of regret and developed an algorithm that searches for actions that exceed the aspiration level with probability $(1-\delta)$ . They proved the finiteness of the regret for their algorithm when $\delta>0$ .

However, it should be noted that in their study, the definition of regret is changed. Specifically, the regret of their algorithm is calculated according to whether or not the expected reward exceeds the aspiration level with probability $(1-\delta)$ , and the definition that regards the regret occurring with probability $\delta$ as zero is adopted. If $\delta=0$ , their regret is calculated according to whether the expected reward always exceeds the aspiration level or not; therefore, it becomes the same framework as that of the ordinary bandit problems. In such cases, the regret of their algorithm increases in the logarithmic order, which is the theoretical limit, and it does not become finite. On the other hand, $RS$ can achieve the finite regret without changing the definition of regret. Therefore, the purposes and problem settings are different in our study and their study.

According to the above-mentioned discussion, it is difficult to compare our study with other satisficing algorithms for reinforcement learning proposed in previous studies because the purposes and frameworks are different. It is sufficient to compare our approach with $PS$ and UCB1. Accordingly, the other algorithms will not be handled directly hereafter.

Performance Comparison

We compare the performance of UCB1T, $PS$ , $\epsilon_{n}\text{-greedy}$ , and $RS$ with $K=100$ through numerical simulations. Furthermore, the reward probabilities are uniformly randomly selected from $[0,1]$ , and the average is over 1,000 simulations. As mentioned above, it is difficult to determine the parameter $c$ of $\epsilon_{n}\text{-greedy}$ . In this simulation, the regret of $\epsilon_{n}\text{-greedy}$ in the 10,000-th step is taken as a reference. It is empirically found by a long parameter sweep such that the regret of $\epsilon_{n}\text{-greedy}$ in the 10,000-th step is minimized at around $c=1\times 10^{-5}$ . Hence, the results of $c=1\times 10^{-6},1\times 10^{-5},1\times 10^{-4}$ are shown as comparison targets. We set $d$ as $d=p_{\mathrm{1st}}-p_{\mathrm{2nd}}$ . As for $RS$ and $PS$ , we set the aspiration level $R$ to an optimal level, $R=(p_{\mathrm{1st}}+p_{\mathrm{2nd}})/2$ , so that we can evaluate the efficiency when satisficing implies optimization.

The results are shown in Fig. 3. As for accuracy, $RS$ approaches 1 the fastest among these algorithms. As for regret, $PS$ increases rapidly because it randomly chooses actions unless an action whose reward is above $R$ is found. The regret of $RS$ remains small (and bound finitely), whereas UCB1T and $\epsilon_{n}\text{-greedy}$ diverge at a logarithmic order. In summary, we can see that $RS$ with the optimal aspiration level $R$ shows better performance than UCB1T, $PS$ , and $\epsilon_{n}\text{-greedy}$ .

Analysis of the Expected Change in Value Functions

Here, we qualitatively consider why $RS$ with the optimal aspiration level $R$ performs better than the other algorithms. Let us consider how the value of $RS$ in the $n$ -th step changes when action $a_{i}$ is chosen in the $(n+1)$ -th step. In the following $RS$ formula,

[TABLE]

$n_{i}^{1}(n)$ is the number of times a reward of 1 is obtained in the choice of action $a_{i}$ from the first to the $n$ -th step. In the $(n+1)$ -th step, the value of $RS$ changes with probability $p_{i}$ to

[TABLE]

whereas it otherwise changes with probability $(1-p_{i})$ to

[TABLE]

Let $\Delta RS(a_{i},n)=RS(a_{i},n+1)-RS(a_{i},n)$ . Then, the expected value of the change, $E[\Delta RS(a_{i},n)]$ , is as follows:

[TABLE]

Thus, we see that the following relationships hold in any step:

[TABLE]

Let $R$ be set to an optimal level. Then, relationship 35 means that once the optimal action $a_{i}$ is chosen, $RS(a_{i})$ will keep increasing on average, and it will continue to be chosen. On the other hand, relationship 36 means that if a non-optimal action $a_{j}$ has the highest $RS$ value, and continues to be chosen for a while, then the value keeps decreasing on average. The value for other actions remains invariant. Therefore, at some point, another action than $a_{j}$ will start to be chosen. Further, note that the $RS$ value decreases at an average rate of $p_{j}-R$ . Therefore, on average, the lower the reward probability of an action, the faster the action will stop being chosen, and another action will start being chosen.

To clarify the idiosyncrasies of $RS$ , we carry out similar analyses for other value functions. First, let us analyze the mean reward. The value function is $Q(a_{i},n)=n_{i}^{1}/n_{i}(=E_{i})$ . When action $a_{i}$ is chosen, $E[\Delta Q(a_{i},n)]$ is given by

[TABLE]

whereas the values for other actions do not change. Further, $E[\Delta Q(a_{i},n)]$ is positive if $p_{i}>E_{i}$ and negative if $p_{i}<E_{i}$ , and both cases may occur regardless of the reward probability $p_{i}$ because $E_{i}$ is a variable, in contrast to the constant $R$ for $RS$ . If action $a_{i}$ is chosen for a sufficient number of times, $p_{i}\approx E_{i}$ holds. Then, it leads to $E[\Delta Q(a_{i},n)]\approx 0$ , and $Q(a_{i},n)$ remains nearly unchanged. This implies that there is a possibility that a non-highest action keeps to be chosen (trapped into a local optimum). Let us consider the simplest example where there are only two actions (with $p_{1}>p_{2}$ ), and choosing the optimal action $a_{1}$ does not give much rewards, leading to $E_{1}<p_{2}$ and $E_{1}<E_{2}$ . As $n_{2}$ increases, $E_{2}$ converges to $E_{2}\approx p_{2}$ , and the relationship of $E_{1}<E_{2}$ becomes fixed because of $E_{1}<p_{2}$ . This leads to $a_{2}$ being chosen constantly. To avoid the local optima, $\epsilon_{n}\text{-greedy}$ prevents a non-highest action from being continuously chosen by randomly choosing actions with probability $\epsilon_{n}$ . With the mean reward, unlike $RS$ , we cannot say that the smaller the reward probability of the action chosen once, the faster on average is the switching of the agent to choose another action.

Next, let us analyze UCB1, which is the simplest algorithm in the UCB family.

[TABLE]

When action $a_{i}$ is chosen, the expected change in the UCB1 value is

[TABLE]

whereas the expected change of non-chosen action $a_{j}$ is as follows:

[TABLE]

In Eq. (39), the first term is the same as that in Eq. (37). In Eq. (39), the second and third terms approach zero if action $a_{i}$ continues to be chosen. Hence, if we consider only Eq. (39), there is a possibility that the non-highest action continues to be chosen, as with Eq. (37). However, in UCB1, the value function of non-chosen action $a_{j}$ also changes, as in Eq. (40). Moreover, we can see that the value of the non-chosen action increases infinitely because of the second term of Eq. (38). As a result, a non-highest action does not continue to be chosen.

In Eq. (39), the first term is positive if $p_{i}>E_{i}$ and negative if $p_{i}<E_{i}$ , and both cases may occur regardless of the reward probability $p_{i}$ because $E_{i}$ is a variable, as it is for $Q$ above. On the other hand, the second term between the parentheses is negative if $n\geq 3$ , which results from the fact that $f(x)=(\ln x)/x$ monotonically decreases with $x>e\,(>2)$ . As a result, $E[\Delta\mathrm{UCB1}(a_{i},n)]$ may be positive or negative, regardless of the reward probability. Therefore, UCB1 does not have the property of $RS$ whereby the action with a lower reward probability will be switched from earlier.

Based on the analyses presented above, let us reconsider the form of $RS_{i}=n_{i}\delta_{i}=n_{i}(E_{i}-R)$ . Starting from the most basic value function of the mean reward, $E_{i}$ , $RS$ is formed through two operations, $(\cdot)-R$ and $n_{i}(\cdot)$ . If it is merely $\delta_{i}$ , the value function $\delta_{i}$ works exactly as the original $E_{i}$ under the greedy policy. On the other hand, if only $n_{i}(\cdot)$ is applied, the value function is $n_{i}E_{i}=n_{i}^{1}$ , and it is a special case of $RS$ with $R=0$ where any action is satisfactory. With $n_{i}E_{i}$ , the agent will continue to choose the first action that gives a reward of 1. By applying the two operations, we acquire the property of $E[\Delta RS(a_{i},n)]=p_{i}-R$ , the constant change in the $RS$ value, regardless of the step number $n$ . Therefore, the $RS$ value of an unsatisfactory action (with the reward probability below the aspiration level) constantly decreases on average; as a result, the action will cease to be chosen at some point. Furthermore, we can say that the smaller the reward probability of the action chosen once, the faster on average is the switching of the $RS$ agent to the choice of other actions. As shown above, UCB and $\epsilon_{n}\text{-greedy}$ have no such property. Therefore, this property is considered to be one of the reasons why the performance of $RS$ using the optimal aspiration level is superior to that of other basic algorithms.

Discussion

In this paper, we introduced a simple model called $RS$ that implements a satisficing strategy for the $K$ -armed bandit problems, which constitute one of the most basic classes of reinforcement learning tasks. We proved two propositions. One is that $RS$ is guaranteed to find a satisfactory action with the reward probability above the aspiration level. The other is that the regret (expected loss) of $RS$ is upper bounded by a finite value when an optimal aspiration level (where satisficing implies optimizing) is given. Then, we confirmed the results through numerical simulations and compared the performance of $RS$ with that of other representative algorithms for the $K$ -armed bandit problems. In addition, we analyzed the property of $RS$ relative to other algorithms and validated why $RS$ has its own form.

Except in Proposition 1, we assumed that we can set the aspiration level $R$ to an optimal level. As the optimal aspiration is not always available to the agent, a future research direction would be to develop an algorithm that can learn an optimal aspiration level $R$ online. As a preliminary result, an algorithm that exploits the properties of $RS$ has shown performance comparable to that of Thompson sampling [23], although it has not been theoretically guaranteed thus far [24].

There are many other advantages of $RS$ besides those mentioned in this paper. For example, the satisficing behavior is scalable in the sense that its performance does not depend on the scale of the problems, such as the number of actions, but rather on the proportion of satisfactory actions, unlike optimization algorithms [15]. In addition, as $RS$ is a simple value function without assumptions such as the family of reward probability distributions, it can be applied to other reinforcement learning tasks through some straightforward generalization. In fact, it has been shown that the generalized $RS$ can conduct autonomous and efficient searches in a robotic motion learning task in which a robot learns to perform giant swings (acrobot) [14].

One of the computational advantages of satisficing, compared to optimization, is that it can convert an optimization problem into a decision problem. With $RS$ , the guaranteed satisficing algorithm, and $R$ at a certain level, we can efficiently determine whether there is an action whose value is above $R$ . The decision framework is especially useful when a certain level of reward, rather than the optimal level, is necessary. It also facilitates parallelization. For example, we can set the aspiration levels $R_{1},R_{2},\dotsc,R_{N}$ to $N$ agents in ascending order, respectively, and make the agents execute a certain task in parallel. If the task succeeds at the level $R_{i}$ and fails at the level $R_{i+1}$ , we can see that the optimal solution exists somewhere in $[R_{i},R_{i+1}]$ , and the interval may be incrementally narrowed down. This is somewhat close to human learning for solving a task. When trying to solve a task, we usually do not randomly try and err in a purely bottom-up manner. Instead, we tend to adopt a top-down constraint in our trials, such as trying to run one mile in four minutes. Guaranteed satisficing may lead to reinforcement learning methods that solve tasks somewhat similarly to humans.

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Number 17H04696.

Author contributions statement

T.T. and A.T. conceived the analyses. A.T. conducted the proofs and experiments. Both the authors wrote and reviewed the manuscript.

Additional information

Competing interests: The authors declare no competing interests.

Supplementary information

In this supplementary material, two distinctive aspects of $RS$ and a generalization of the two propositions in the main text are discussed. First, we show that $RS$ can be considered as a generalization of another model, $S0$ [25]. $S0$ in the bandit setting is based on the premise that high performance can be achieved through competitive evaluation of actions. However, our generalization from $S0$ to $RS$ shows that competitive evaluation appears only in the two-armed settings, and in general (in the $K$ -armed settings), the fundamental is the risk-sensitive satisficing behavior. Second, we compare $RS$ with the Tug-of-war (TOW) dynamics models [20, 21, 22], which was referred to in the proof of Proposition 2. TOW is based on the notion of conservation of physical quantities, and it leads to competitive evaluation. We show that, under certain conditions, $RS$ has the same mathematical form as a part of the recent TOW dynamics models. In addition, $S0$ and TOW are both limited in terms of their application to the evaluation of only two actions (or two classes of actions). On the other hand, as materialized in $RS$ , the notion of risk-sensitive satisficing enables generalization (to an arbitrary number of actions), simplification, conceptual clarity, and high performance in terms of satisficing, as suggested in the main text of the paper. Third, we slightly generalize Propositions 1 and 2 and their proofs in the main text assuming that the aspiration level $R$ is within a certain range.

A. $RS$ as a generalization of $S0$

First, we show that $RS$ is a generalization of another value function $S0$ , from the number of actions $K=2$ to arbitrary $K\geq 2$ and from constant aspiration level 0.5 to variable $R\in[0,1]$ . $RS$ discussed in this paper was formerly called reference satisficing [14, 15]. It was subsequently renamed as risk-sensitive satisficing to characterize it more specifically, and abbreviated invariantly as $RS$ . $RS$ contains $S0$ model in a special form, which was first introduced by Shinohara et al. as a causal reasoning model [25]. The $S0$ model was later termed as the $RS$ (rigidly symmetric) model [26], and was then used as a value function [27] in the bandit problems. Subsequent studies applied $S0$ in the two-armed bandit problems, and the performance of $S0$ was found to be similar to that of $LS$ [27], which is a more complicated model. An analysis of these behaviors from a satisficing perspective was first published in 2013 [28, 29]. The aspiration level for satisficing was made variable in 2011 [30]. Subsequently, in 2012[31], its generalization from two to any arbitrary number of actions of the model was proposed. However, $LS$ is much more complicated than $RS$ , and the analysis was rather indirect. Hereafter, we show the equivalence of $RS$ and $S0$ under certain conditions (for two actions with $R=0.5$ ).

Let $A$ and $B$ be actions in a two-armed bandit problem. Let $a_{X}^{1}$ be the number of times the choice of action $X\in\{A,B\}$ has given reward 1, and let $a_{X}^{0}$ be the number of times the choice of action X has given reward 0 (no reward). Thus, the mean reward is $a_{X}^{1}/(a_{X}^{1}+a_{X}^{0})$ . Here, $S0$ defines the values of actions $A$ and $B$ as follows:

[TABLE]

These comparative evaluations identify both the obtaining of reward from action $A$ and not obtaining of reward from action $B$ . Hence, $S0(B)=1-S0(A)$ holds. Because the denominator is common, the comparison of the two values eventually results in the selection of action $A$ if the following inequality holds; if the inequality does not hold, action $B$ is selected:

[TABLE]

From the above inequality, we can see that transitive law is established when adding action $C$ . That is, let the $S0$ evaluation of $A$ in comparison with $B$ be represented as $S0_{AB}(A)$ . If $S0_{AB}(A)<S0_{BA}(B)$ and $S0_{BC}(B)<S0_{CB}(C)$ , then $S0_{AC}(A)<S0_{CA}(C)$ . Thus, we see that the comparable number of actions is not necessarily $K=2$ . The inequality (43) can be expressed as

[TABLE]

Using the notations presented in this paper, $a_{X}^{1}=n_{X}E_{X}$ and $a_{X}^{0}=n_{X}(1-E_{X})$ holds. Then,

[TABLE]

It can be seen that both sides of inequality (47) are identical to the form of $RS$ (equation (4) in the main article) with $R=0.5$ . Because the value of a set of arbitrary actions can be totally ordered thanks to the property of transitivity, it is only necessary to calculate the $RS$ value for each action, independently of all the other actions, and choose the action with the maximum value.

B. Comparison of $RS$ and TOW

We referred to the TOW dynamics model [20, 21, 22] (hereafter simply referred to as TOW) in the proof of Proposition 2. Here, we compare $RS$ and TOW, and describe the relative advantages of $RS$ over TOW. There are many variations of TOW, starting from around 2010 [32]. Here, we focus on recent papers [21, 22] where the proposed model of TOW is the closest to that of $RS$ . Let $X_{k,i}$ be a random variable, representing the reward obtained by the $i$ -th choice of the action $k$ . Something like the value of action $k$ in TOW can be expressed as

[TABLE]

where $K$ is a parameter.

Let $n_{k}$ be the number of time action $k$ is chosen, and $E_{k}$ be the average rewards obtained by choosing action $k$ , such that $E_{k}=\sum_{i=1}^{n_{k}}X_{k,i}/n_{k}$ . Although in the main text of the paper, the probability distributions of the rewards were assumed to be the Bernoulli distributions, herein, the distribution does not necessarily have to be Bernoulli. The value function $RS_{k}$ of the action $k$ of $RS$ is equivalent to the following form, as given in the proof of Proposition 2:

[TABLE]

When parameter $K$ in (48) is interpreted as the aspiration level $R$ in equation (49), $RS$ has the same mathematical form as a part of the recent TOW dynamics models under certain conditions. Hence, the regret calculation of TOW can be applied to $RS$ as well, and the regret of $RS$ also is upper bounded like TOW. In this work, we relaxed the assumption of equal variance in the proof for TOW.

However, there exist certain differences between $RS$ and TOW. The primary difference is that they model totally different phenomena. $RS$ is modeled on how humans make decisions (satisficing), while taking into account the associated risks. Moreover, as explained in Supplementary information A, $RS$ is also a generalized model of $S0$ model in causal reasoning. On the other hand, TOW is derived from physical laws like volume conservation. An advantage of $RS$ over TOW lies in its simplicity, clarity, and generalizability. As regards clarity, $RS$ is the product of “reliability of obtained information” and “degree of satisficing,” and the parameter $R$ is associated to “aspiration.” On the other hand, the interpretation of the parameter $K$ of TOW, which corresponds to $R$ in $RS$ , is not necessarily clear. Therefore, through straightforward generalization of these two constituent concepts, $RS$ need to be applied not only to the $K$ -armed bandit problems (instead of two-armed) but also generally to reinforcement learning settings [14].

C. Propositions 1 and 2 When the Aspiration Level $R$ is Variable

Both of Propositions 1 and 2 assume that the aspiration level $R$ is constant. When $R$ is variable or stochastic, similar propositions can be established just by slightly modifying the previous proofs assuming that $R$ is within a certain range. We show only the changes made in Proposition 3 from Proposition 1, and in Proposition 4 from 2. In the proofs below, the symbols are the same as Propositions 1 and 2 except for the ones specified below. Let the minimum and the maximum of the variable aspiration level $R$ be $R_{\min}$ and $R_{\max}$ , respectively.

Proposition 3 (Modified Proposition 1 for Variable Aspiration Level $R$ ).

We assume that the both of $A_{U}$ and $A_{L}$ are invariant even if the aspiration level $R$ changes temporally or stochastically. More specifically, we assume that $p_{l}<R_{\min}\leq R\leq R_{\max}<p_{u}$ holds, where $p_{l}$ and $p_{u}$ are the maximum of the reward probabilities in $A_{L}$ and the minimum of the reward probabilities in $A_{U}$ , respectively. Under this assumption, Proposition 1 is established as it is.

Proof.

The proof of Claim A for Proposition 1 needs to be changed as follows in the part where the law of large numbers is used. For any positive number $\epsilon$ , there exists some $S$ such that we have $P\bigl{(}|E(a_{i},s)-p_{i}|<(R_{\min}-p_{i})/2\bigr{)}>1-\epsilon$ for any integer $s\in N_{i}$ greater than $S$ . Now, if $|E(a_{i},s)-p_{i}|<(R_{\min}-p_{i})/2$ , we have $RS(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R\bigr{)}<n_{i}(s)\cdot\bigl{(}p_{i}+(R_{\min}-p_{i})/2-R_{\min}\bigr{)}=n_{i}(s)\cdot(p_{i}-R_{\min})/2<0.$ Hereafter the proof is the same as that of Claim A in Proposition 1. The proof of Claim B needs no change.

The proof of Proposition 1 needs to be similarly changed as follows. For any positive number $\epsilon$ , there exists some $S$ such that we have $P\bigl{(}|E(a_{k},s)-p_{k}|<(p_{k}-R_{\max})/2\bigr{)}>1-\epsilon$ for any integer $s\in N_{i}$ greater than $S$ . Now if $|E(a_{k},s)-p_{k}|<(p_{k}-R_{\max})/2$ , we have $RS(a_{k},s)=n_{k}(s)\cdot\bigl{(}E(a_{k},s)-R\bigr{)}>n_{k}(s)\cdot\bigl{(}p_{k}+(R_{\max}-p_{k})/2-R_{\max}\bigr{)}=n_{k}(s)\cdot(p_{k}-R_{\max})/2>0.$ Hereafter the proof is the same as that of Proposition 1. ∎

Proposition 4 (Modified Proposition 2 for the Variable Aspiration Level $R$ ).

We assume that the aspiration level $R$ satisfies $p_{2}<R_{\min}\leq R\leq R_{\max}<p_{1}$ , even if the aspiration level $R$ changes temporally or stochastically. Under this assumption, we can still prove that the regret is upper bounded by a finite value.

Proof.

Let $RS(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R\bigr{)}\,\,\,(i=1,2,\dotsc,K)$ . Here, we define $RS_{bd}(a_{1},s)$ as follows: $RS_{bd}(a_{1},s)=n_{1}(s)\cdot\bigl{(}E(a_{1},s)-R_{\max}\bigr{)}\leq n_{1}(s)\cdot\bigl{(}E(a_{1},s)-R\bigr{)}=RS(a_{1},s).$ Also, we define $RS_{bd}(a_{i},s)$ for $i\neq 1$ , as follows: $RS_{bd}(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R_{\min}\bigr{)}\geq n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R\bigr{)}=RS(a_{i},s).$ The suffix $bd$ means using the boundary of the aspiration level $R$ . If we let $R_{i}=R_{\max}\,\,\,(i=1),R_{\min}\,\,\,(i\neq 1)$ , then, we have $RS_{bd}(a_{i},s)=n_{i}(s)\cdot\bigl{(}E(a_{i},s)-R_{i}\bigr{)}\,\,\,(i=1,2,\dotsc,K)$ .

The expectation $E$ and the variance $V$ of $RS_{bd}(a_{i},s)$ are $E[RS_{bd}(a_{i},s)]=n_{i}(s)(p_{i}-R_{i})$ and $V[RS_{bd}(a_{i},s)]=n_{i}(s)\sigma_{i}^{2}$ , respectively, where $\sigma_{i}^{2}=p_{i}(1-p_{i})$ . Let $\Delta RS_{i}(s)=RS(a_{1},s)-RS(a_{i},s)\,\,\,(i\neq 1)$ , and $\Delta RS_{bd,i}(s)=RS_{bd}(a_{1},s)-RS_{bd}(a_{i},s)\,\,\,(i\neq\leavevmode\nobreak\ 1)$ . Note that $\Delta RS_{i}(s)\geq\Delta RS_{bd,i}(s)$ holds because $RS(a_{1},s)\geq RS_{bd}(a_{1},s)$ and $RS(a_{i},s)\leq RS_{bd}(a_{i},s)\,\,\,(i\neq 1)$ .

$E[\Delta RS_{bd,i}(s)]$ , which is the expectation of $\Delta RS_{bd,i}(s)$ , is evaluated as follows:

[TABLE]

By Proposition 3, if the step number $s$ is sufficiently large, then $n_{1}(s)\rightarrow s$ with probability 1 (the same approximation as in the proof of Proposition 2, hence the same note applies). Hence, $E[\Delta RS_{bd,i}(s)]\geq s\cdot\leavevmode\nobreak\ \min\bigl{(}(p_{1}-R_{\max}),(R_{\min}-p_{2})\bigr{)}$ . Also, $V[\Delta RS_{bd,i}(s)]$ of the variance of $\Delta RS_{bd,i}(s)$ is evaluated as follows: $V[\Delta RS_{bd,i}(s)]\leq(n_{1}(s)+n_{i}(s))\sigma_{1,i}^{2}\leq s\sigma_{1,i}^{2}$ , where $\sigma_{1,i}=\max(\sigma_{1},\sigma_{i})$ .

By the central limit theorem, $\Delta RS_{bd,i}(s)$ follows the normal distribution with expectation $E[\Delta RS_{bd,i}(s)]$ and variance $V[\Delta RS_{bd,i}(s)]$ . The probability that $\Delta RS_{bd,i}(s)\leq 0$ is $Q(E[\Delta RS_{bd,i}(s)]/\sqrt{V[\Delta RS_{bd,i}(s)]})$ . Then, $P[s=n+1;I=i]$ , which the probability that action $a_{i}$ is chosen in the $(n+1)$ -th step, is given by

[TABLE]

where we set $\phi_{i}=\min\bigl{(}(p_{1}-R_{\max}),(R_{\min}-p_{2})\bigr{)}/(\sigma_{1,i})$ . Hereafter the proof is the same as that of Proposition 2. As a result, the upper bound of regret is obtained by replacing $\phi_{i}$ in Eq. (27) with $\phi_{i}$ set in Eq. (51). ∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Mnih, V. et al. Human-level control through deep reinforcement learning. \Journal Title Nature 518 , 529–533, DOI: 10.1038/nature 14236 (2015).
2[2] Silver, D. et al. Mastering the game of go with deep neural networks and tree search. \Journal Title Nature 529 , 484–489, DOI: 10.1038/nature 16961 (2016).
3[3] Muse, D. & Wermter, S. Actor-critic learning for platform-independent robot navigation. \Journal Title Cognitive Computation 1 , 203–220, DOI: 10.1007/s 12559-009-9021-z (2009).
4[4] Zhao, F., Zeng, Y., Wang, G., Bai, J. & Xu, B. A brain-inspired decision making model based on top-down biasing of prefrontal cortex to basal ganglia and its application in autonomous uav explorations. \Journal Title Cognitive Computation 10 , 296–306, DOI: 10.1007/s 12559-017-9511-3 (2018).
5[5] Simon, H. A. Models of Man: Social and Rational (John Wiley and Sons, Inc., New York, 1957).
6[6] Lewis, R. L., Howes, A. & Singh, S. Computational rationality: Linking mechanism and behavior through bounded utility maximization. \Journal Title Topics in Cognitive Science 6 , 279–311, DOI: 10.1111/tops.12086 (2014).
7[7] Gershman, S. J., Horvitz, E. J. & Tenenbaum, J. B. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. \Journal Title Science 349 , 273–278, DOI: 10.1126/science.aac 6076 (2015).
8[8] Tenenbaum, J. B., Kemp, C., Griffiths, T. L. & Goodman, N. D. How to grow a mind: Statistics, structure, and abstraction. \Journal Title Science 331 , 1279–1285, DOI: 10.1126/science.1192788 (2011).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function

Abstract

Introduction

Methods

KKK-armed Bandit Problems

Models of Satisficing

Policy Satisficing (PSPSPS) Model

Risk-sensitive Satisficing (RSRSRS) Value Function

Setting of the Aspiration Level

Results

Analysis

Guarantee of Satisficing

Proposition 1** (Theoretical Guarantee of Satisficing).**

Claim A**.**

Proof.

Claim B**.**

Proof.

Proposition 1** (again).**

Proof.

Theoretical Analysis of Regret

Proposition 2** (Finiteness of Regret of RSRSRS).**

Proof.

When the Aspiration Level RRR is Variable

Empirical Verification

Comparison with Other Algorithms

UCB1-Tuned

ϵn-greedy\epsilon_{n}\text{-greedy}ϵn​-greedy

Existing Satisficing Models

Performance Comparison

Analysis of the Expected Change in Value Functions

Discussion

Acknowledgements

Author contributions statement

Additional information

Supplementary information

A. RSRSRS as a generalization of S0S0S0

B. Comparison of RSRSRS and TOW

C. Propositions 1 and 2 When the Aspiration Level RRR is Variable

Proposition 3** (Modified Proposition 1 for Variable Aspiration Level RRR).**

Proof.

Proposition 4** (Modified Proposition 2 for the Variable Aspiration Level RRR).**

Proof.

$K$ -armed Bandit Problems

Policy Satisficing ( $PS$ ) Model

Risk-sensitive Satisficing ( $RS$ ) Value Function

Proposition 1 (Theoretical Guarantee of Satisficing).

Claim A.

Claim B.

Proposition 1 (again).

Proposition 2 (Finiteness of Regret of $RS$ ).

When the Aspiration Level $R$ is Variable

$\epsilon_{n}\text{-greedy}$

A. $RS$ as a generalization of $S0$

B. Comparison of $RS$ and TOW

C. Propositions 1 and 2 When the Aspiration Level $R$ is Variable

Proposition 3 (Modified Proposition 1 for Variable Aspiration Level $R$ ).

Proposition 4 (Modified Proposition 2 for the Variable Aspiration Level $R$ ).