Modeling and Optimization of Human-machine Interaction Processes via the   Maximum Entropy Principle

Jiaxiao Zheng; Gustavo de Veciana

arXiv:1903.07157·cs.LG·March 19, 2019

Modeling and Optimization of Human-machine Interaction Processes via the Maximum Entropy Principle

Jiaxiao Zheng, Gustavo de Veciana

PDF

Open Access

TL;DR

This paper introduces a maximum entropy-based framework and the AREA algorithm for modeling and optimizing human-machine interactions, addressing challenges like human behavior complexity and bias in data-driven inference.

Contribution

It presents a novel iterative approach combining behavior estimation and policy optimization using maximum entropy principles, with theoretical analysis and initial validation.

Findings

01

AREA algorithm converges under specified conditions

02

Effective modeling of complex human decision-making behaviors

03

Initial validation shows promising results on synthetic data

Abstract

We propose a data-driven framework to enable the modeling and optimization of human-machine interaction processes, e.g., systems aimed at assisting humans in decision-making or learning, work-load allocation, and interactive advertising. This is a challenging problem for several reasons. First, humans' behavior is hard to model or infer, as it may reflect biases, long term memory, and sensitivity to sequencing, i.e., transience and exponential complexity in the length of the interaction. Second, due to the interactive nature of such processes, the machine policy used to engage with a human may bias possible data-driven inferences. Finally, in choosing machine policies that optimize interaction rewards, one must, on the one hand, avoid being overly sensitive to error/variability in the estimated human model, and on the other, being overly deterministic/predictable which may result in…

Tables1

Table 1. Table 1: Table of notations

sequence	$h^{t} = {h_{1}, h_{2}, \dots, h_{t}}, 0 \leq t \leq T$
Sequence of human actions	$H^{t} = {H_{1}, H_{2}, \dots, H_{t}}, 0 \leq t \leq T$
Specific realization of human action	$h^{t} = {h_{1}, h_{2}, \dots, h_{t}}, 0 \leq t \leq T$
Sequence of machine actions	$M^{t} = {M_{1}, M_{2}, \dots, M_{t}}, 0 \leq t \leq T$
Specific realization of machine action	$m^{t} = {m_{1}, m_{2}, \dots, m_{t}}, 0 \leq t \leq T$
sequence	$m^{t} = {m_{1}, m_{2}, \dots, m_{t}}, 0 \leq t \leq T$
Joint PMF of $M^{T}, H^{T}$	$p_{H^{T}, M^{T}} (h^{T}, m^{T})$
Causally conditional distribution of	$p_{H^{T} ∥ M^{T}} (h^{T} ∥ m^{T})$ or $P (h^{T} ∥ m^{T})$
human actions given machine actions	$p_{H^{T} ∥ M^{T}} (h^{T} ∥ m^{T})$ or $P (h^{T} ∥ m^{T})$
Causally conditional distribution of	$p_{M^{T} ∥ H^{T}} (m^{T} ∥ h^{T})$ or $Q (m^{T} ∥ h^{T})$
machine actions given human actions	$p_{M^{T} ∥ H^{T}} (m^{T} ∥ h^{T})$ or $Q (m^{T} ∥ h^{T})$
Joint PMF when the human model is	$P Q (h^{T}, m^{T})$
$P (h^{T} ∥ m^{T})$ and machine model is
$Q (m^{T} ∥ h^{T})$
Probability of event $A$ when the human	$P Q (A)$
model is $P (h^{T} ∥ m^{T})$ and machine model
is $Q (m^{T} ∥ h^{T})$ .
Expectation of function of interactions	$𝔼_{P Q} [f (H^{T}, M^{T})]$
$f (H^{T}, M^{T})$ w.r.t. the joint PMF
given by $P Q (h^{T}, m^{T})$
The actual human behavior model	$P^{} (h^{T} ∥ m^{T})$ or $P^{}$
The estimated human behavior model	$\hat{P} (h^{T} ∥ m^{T}) = h (Q)$ or $\hat{P} = h (Q)$
if the machine model is $Q$	$\hat{P} (h^{T} ∥ m^{T}) = h (Q)$ or $\hat{P} = h (Q)$
The machine model if the estimated	$\hat{Q} (m^{T} ∥ h^{T}) = m (P)$ or $\hat{Q} = m (P)$
human model is $P$	$\hat{Q} (m^{T} ∥ h^{T}) = m (P)$ or $\hat{Q} = m (P)$

Equations148

P (h^{T} ∣∣ m^{T}) := t = 1 \prod T P (h_{t} ∣ h^{t - 1}, m^{t}) \mbox an d Q (m^{T} ∣∣ h^{T}) := t = 1 \prod T Q (m_{t} ∣ h^{t - 1}, m^{t - 1}),

P (h^{T} ∣∣ m^{T}) := t = 1 \prod T P (h_{t} ∣ h^{t - 1}, m^{t}) \mbox an d Q (m^{T} ∣∣ h^{T}) := t = 1 \prod T Q (m_{t} ∣ h^{t - 1}, m^{t - 1}),

P_{F, G}^{Q} =

P_{F, G}^{Q} =

H_{P Q} (H^{T} ∥ M^{T}) := E_{P Q} [- lo g (P (H^{T} ∥ M^{T}))] = t = 1 \sum T H_{P Q} (H_{t} ∣ H^{t - 1}, M^{t}) .

H_{P Q} (H^{T} ∥ M^{T}) := E_{P Q} [- lo g (P (H^{T} ∥ M^{T}))] = t = 1 \sum T H_{P Q} (H_{t} ∣ H^{t - 1}, M^{t}) .

P (h^{T} ∥ m^{T}) max {H_{P Q} (H^{T} ∥ M^{T}) ∣ P (h^{T} ∥ m^{T}) \in P_{F, G}^{Q}} .

P (h^{T} ∥ m^{T}) max {H_{P Q} (H^{T} ∥ M^{T}) ∣ P (h^{T} ∥ m^{T}) \in P_{F, G}^{Q}} .

Q (m^{T} ∥ h^{T}) max E_{\hat{P} Q} [r (H^{T}, M^{T})] .

Q (m^{T} ∥ h^{T}) max E_{\hat{P} Q} [r (H^{T}, M^{T})] .

Q (m^{T} ∥ h^{T}) max

Q (m^{T} ∥ h^{T}) max

λ = (λ_{f}, λ_{g}), λ_{g} \leq 0 min m_{1} \sum Q (m_{1}) lo g Z_{λ} (m_{1}) - λ_{f}^{T} c_{f} - λ_{g}^{T} c_{g}

λ = (λ_{f}, λ_{g}), λ_{g} \leq 0 min m_{1} \sum Q (m_{1}) lo g Z_{λ} (m_{1}) - λ_{f}^{T} c_{f} - λ_{g}^{T} c_{g}

Z_{λ} (h^{t}, m^{t + 1}) = h_{t + 1} \sum Z_{λ} (h_{t + 1} ∣ h^{t}, m^{t + 1}), Z_{λ} (m_{1}) = h_{1} \sum Z_{λ} (h_{1} ∣ m_{1})

Z_{λ} (h^{t}, m^{t + 1}) = h_{t + 1} \sum Z_{λ} (h_{t + 1} ∣ h^{t}, m^{t + 1}), Z_{λ} (m_{1}) = h_{1} \sum Z_{λ} (h_{1} ∣ m_{1})

\displaystyle Z_{\bm{\lambda}}(h_{t}|h^{t-1},m^{t})=\left\{\begin{array}[]{ll}e^{\sum_{m_{t+1}}Q(m_{t+1}|h^{t},m^{t})\log Z_{\bm{\lambda}}(h^{t},m^{t+1})}&t<T\\ e^{\bm{\lambda}_{f}^{T}\mathbf{f}_{1}(h^{T},m^{T})+\bm{\lambda}_{g}^{T}\mathbf{f}_{2}(h^{T},m^{T})}&t=T\end{array}\right.,

\displaystyle Z_{\bm{\lambda}}(h_{t}|h^{t-1},m^{t})=\left\{\begin{array}[]{ll}e^{\sum_{m_{t+1}}Q(m_{t+1}|h^{t},m^{t})\log Z_{\bm{\lambda}}(h^{t},m^{t+1})}&t<T\\ e^{\bm{\lambda}_{f}^{T}\mathbf{f}_{1}(h^{T},m^{T})+\bm{\lambda}_{g}^{T}\mathbf{f}_{2}(h^{T},m^{T})}&t=T\end{array}\right.,

λ max E_{P^{*} Q} [lo g P_{λ} (H^{T} ∥ M^{T})]

λ max E_{P^{*} Q} [lo g P_{λ} (H^{T} ∥ M^{T})]

P (h^{T} ∥ m^{T}) in f P^{*} (h^{T} ∥ m^{T}) sup

P (h^{T} ∥ m^{T}) in f P^{*} (h^{T} ∥ m^{T}) sup

E_{P Q} [f (H^{T}, M^{T})] = E_{P^{*} Q} [f (H^{T}, M^{T})]

\displaystyle Y_{\gamma}(m_{t}|h^{t-1},m^{t-1})=\left\{\begin{array}[]{ll}e^{\sum_{h_{t}}\hat{P}(h_{t}|h^{t-1},m^{t})\log Y_{\gamma}(h^{t},m^{t})}&t<T\\ e^{\gamma\sum_{h_{T}}\hat{P}(h_{T}|h^{T-1},m^{T})r(h^{T},m^{T})}&t=T\end{array}\right.,

\displaystyle Y_{\gamma}(m_{t}|h^{t-1},m^{t-1})=\left\{\begin{array}[]{ll}e^{\sum_{h_{t}}\hat{P}(h_{t}|h^{t-1},m^{t})\log Y_{\gamma}(h^{t},m^{t})}&t<T\\ e^{\gamma\sum_{h_{T}}\hat{P}(h_{T}|h^{T-1},m^{T})r(h^{T},m^{T})}&t=T\end{array}\right.,

F_{p}^{i} = {f^{i, t} (h^{T}, m^{T}) ∣ f^{i, t} (h^{T}, m^{T}) = 1_{{(h^{t}, m^{t}) = (\overset{ˉ}{h}^{i, t}, \overset{m}{ˉ}^{i, t})}}, \mbox f or t = 1, \dots, T},

F_{p}^{i} = {f^{i, t} (h^{T}, m^{T}) ∣ f^{i, t} (h^{T}, m^{T}) = 1_{{(h^{t}, m^{t}) = (\overset{ˉ}{h}^{i, t}, \overset{m}{ˉ}^{i, t})}}, \mbox f or t = 1, \dots, T},

λ_{f}^{(n + 1)} \leftarrow λ_{f}^{(n)} - η^{(n)} (E_{P_{λ} Q} [f (H^{T}, M^{T})] - c_{f}),

λ_{f}^{(n + 1)} \leftarrow λ_{f}^{(n)} - η^{(n)} (E_{P_{λ} Q} [f (H^{T}, M^{T})] - c_{f}),

λ_{g}^{(n + 1)} \leftarrow max {0, λ_{g}^{(n)} - η^{(n)} (E_{P_{λ} Q} [g (H^{T}, M^{T})] - c_{g})},

Λ (Q, β) =

Λ (Q, β) =

- 1 \leq t \leq T h^{t - 1} \in H^{t - 1} m^{t - 1} \in M^{t - 1} \sum β (h^{t - 1}, m^{t - 1}) (1 - m_{t} \sum Q (m_{t} ∣ h^{t - 1}, m^{t - 1})),

\nabla_{Q (m_{t} ∣ h^{t - 1}, m^{t - 1})} Λ (Q, β) =

\nabla_{Q (m_{t} ∣ h^{t - 1}, m^{t - 1})} Λ (Q, β) =

H_{\hat{P} Q} (M^{T} ∥ H^{T} ∣ h^{t - 1}, m^{t}) := E_{\hat{P} Q} [- lo g Q (M^{T} ∥ H^{T}) ∣ H^{t - 1} = h^{t - 1}, M^{t - 1} = m^{t - 1}] .

H_{\hat{P} Q} (M^{T} ∥ H^{T} ∣ h^{t - 1}, m^{t}) := E_{\hat{P} Q} [- lo g Q (M^{T} ∥ H^{T}) ∣ H^{t - 1} = h^{t - 1}, M^{t - 1} = m^{t - 1}] .

\displaystyle Z_{\boldsymbol{\lambda}}(h_{t}|m_{t})=\left\{\begin{array}[]{ll}e^{(\boldsymbol{\lambda}_{f})^{T}\mathbf{f}_{t}(h_{t},m_{t})+(\boldsymbol{\lambda}_{g})^{T}\mathbf{g}_{t}(h_{t},m_{t})+\sum_{m_{t+1}}Q({m}_{t+1}|h_{t},m_{t})\log Z_{\boldsymbol{\lambda}}(m_{t+1})}&t<T\\ e^{(\boldsymbol{\lambda}_{f})^{T}\mathbf{f}_{T}(h_{T},m_{T})+(\boldsymbol{\lambda}_{g})^{T}\mathbf{g}_{T}(h_{T},m_{T})}&t=T\end{array}\right.,

\displaystyle Z_{\boldsymbol{\lambda}}(h_{t}|m_{t})=\left\{\begin{array}[]{ll}e^{(\boldsymbol{\lambda}_{f})^{T}\mathbf{f}_{t}(h_{t},m_{t})+(\boldsymbol{\lambda}_{g})^{T}\mathbf{g}_{t}(h_{t},m_{t})+\sum_{m_{t+1}}Q({m}_{t+1}|h_{t},m_{t})\log Z_{\boldsymbol{\lambda}}(m_{t+1})}&t<T\\ e^{(\boldsymbol{\lambda}_{f})^{T}\mathbf{f}_{T}(h_{T},m_{T})+(\boldsymbol{\lambda}_{g})^{T}\mathbf{g}_{T}(h_{T},m_{T})}&t=T\end{array}\right.,

Z_{λ} (m_{t}) = h_{t} \sum Z_{λ} (h_{t} ∣ m_{t}), and P_{λ} (h_{t} ∣ m_{t}) = \frac{Z _{λ} ( h _{t} ∣ m _{t} )}{Z _{λ} ( m _{t} )} .

Z_{λ} (m_{t}) = h_{t} \sum Z_{λ} (h_{t} ∣ m_{t}), and P_{λ} (h_{t} ∣ m_{t}) = \frac{Z _{λ} ( h _{t} ∣ m _{t} )}{Z _{λ} ( m _{t} )} .

Z_{λ} (h_{t} ∣ h^{t - 1}, m^{t}) =

Z_{λ} (h_{t} ∣ h^{t - 1}, m^{t}) =

\times e^{(λ_{f})^{T} f_{t} (h_{t}, m_{t}) + (λ_{g})^{T} g_{t} (h_{t}, m_{t}) + \sum_{m_{t + 1}} Q (m_{t + 1} ∣ h_{t}, m_{t}) l o g Z_{λ} (m_{t + 1})}

P_{λ} (h_{t} ∣ h^{t - 1}, m^{t})

P_{λ} (h_{t} ∣ h^{t - 1}, m^{t})

Z_{λ} (h_{t} ∣ h^{t - 1}, m^{t})

Z_{λ} (h_{t} ∣ h^{t - 1}, m^{t})

E_{P_{λ} Q} [f^{i} (H^{T}, M^{T})] = t = 1 \sum T E_{P_{λ} Q} [f_{t}^{i} (H_{t}, M_{t})] .

E_{P_{λ} Q} [f^{i} (H^{T}, M^{T})] = t = 1 \sum T E_{P_{λ} Q} [f_{t}^{i} (H_{t}, M_{t})] .

E_{P_{λ} Q} [f_{t}^{i} (H_{t}, M_{t})] = m_{t} \in M \sum P_{λ} Q (m_{t}) h_{t} \in H \sum P_{λ} (h_{t} ∣ m_{t}) f_{t} (h_{t}, m_{t}) .

E_{P_{λ} Q} [f_{t}^{i} (H_{t}, M_{t})] = m_{t} \in M \sum P_{λ} Q (m_{t}) h_{t} \in H \sum P_{λ} (h_{t} ∣ m_{t}) f_{t} (h_{t}, m_{t}) .

P_{λ} Q (m_{t}) = m_{t - 1} \in M \sum P_{λ} Q (m_{t - 1}) h_{t - 1} \in H \sum P_{λ} (h_{t - 1} ∣ m_{t - 1}) Q (m_{t} ∣ h_{t - 1}, m_{t - 1}) .

P_{λ} Q (m_{t}) = m_{t - 1} \in M \sum P_{λ} Q (m_{t - 1}) h_{t - 1} \in H \sum P_{λ} (h_{t - 1} ∣ m_{t - 1}) Q (m_{t} ∣ h_{t - 1}, m_{t - 1}) .

f^{i} (h^{T}, m^{T}) = c_{i} 1_{{(h^{T}, m^{T}) = (\overset{ˉ}{h}^{i, T}, \overset{m}{ˉ}^{i, T})}}, i \in F_{p}, and g^{i} (h^{T}, m^{T}) = c_{i} 1_{{(h^{T}, m^{T}) = (\overset{ˉ}{h}^{i, T}, \overset{m}{ˉ}^{i, T})}}, i \in G_{p},

f^{i} (h^{T}, m^{T}) = c_{i} 1_{{(h^{T}, m^{T}) = (\overset{ˉ}{h}^{i, T}, \overset{m}{ˉ}^{i, T})}}, i \in F_{p}, and g^{i} (h^{T}, m^{T}) = c_{i} 1_{{(h^{T}, m^{T}) = (\overset{ˉ}{h}^{i, T}, \overset{m}{ˉ}^{i, T})}}, i \in G_{p},

f^{i} (h^{T}, m^{T}) = t = 1 \sum T f_{t}^{i} (h_{t}, m_{t}), i \in F_{d}, and g^{i} (h^{T}, m^{T}) = t = 1 \sum T g_{t}^{i} (h_{t}, m_{t}), i \in G_{d} .

f^{i} (h^{T}, m^{T}) = t = 1 \sum T f_{t}^{i} (h_{t}, m_{t}), i \in F_{d}, and g^{i} (h^{T}, m^{T}) = t = 1 \sum T g_{t}^{i} (h_{t}, m_{t}), i \in G_{d} .

r (h^{T}, m^{T}) = i \in R_{p} \sum r^{i, p} (h^{T}, m^{T}) + t = 1 \sum T r_{t}^{d} (h_{t}, m_{t}) .

r (h^{T}, m^{T}) = i \in R_{p} \sum r^{i, p} (h^{T}, m^{T}) + t = 1 \sum T r_{t}^{d} (h_{t}, m_{t}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Energy, Environment, and Transportation Policies

Full text

Modeling and Optimization of Human-Machine Interaction Processes via the Maximum Entropy Principle

Jiaxiao Zheng, Gustavo de Veciana

Department of Electrical and Computer Engineering

The University of Texas at Austin

Austin, TX 78712

[email protected]

Abstract

We propose a data-driven framework to enable the modeling and optimization of human-machine interaction processes, e.g., systems aimed at assisting humans in decision-making or learning, work-load allocation, and interactive advertising. This is a challenging problem for several reasons. First, humans’ behavior is hard to model or infer, as it may reflect biases, long term memory, and sensitivity to sequencing, i.e., transience and exponential complexity in the length of the interaction. Second, due to the interactive nature of such processes, the machine policy used to engage with human may bias possible data-driven inferences. Finally, in choosing machine policies that optimize interaction rewards, one must, on the one hand, avoid being overly sensitive to error/variability in the estimated human model, and on the other, being overly deterministic/predictable which may result in poor human ‘engagement’ in the interaction. To meet these challenges, we propose a robust approach, based on the maximum entropy principle, which iteratively estimates human behavior and optimizes the machine policy–Alternating Entropy-Reward Ascent (AREA) algorithm. We characterize AREA, in terms of its space and time complexity and convergence. We also provide an initial validation based on synthetic data generated by an established noisy nonlinear model for human decision-making.

1 Introduction

Computing and information systems are increasingly prevalent in our daily lives, which forms a variety of human-in-the-loop systems. Many such systems are interactive in the sense that, humans and machines take decisions/actions in response to each other, forming a sequence driven by unknown dynamics associated with human behavior. For instance, one can view web searches as an interactive process, where humans’ search history, attention, and eventual decisions reflect an interaction with the machine’s sequencing, placement and timing of advertisements. The industry refers to such interactive processes as ‘convergence paths’ and is increasingly interested in optimizing their outcomes HCE16 . Such problem involving interactions are usually studied under the context of Markov decision processes (MDP) and its variants, see, e.g., Put94 ; Ber00 ; BaR11 . However, the actual problem associated with interactive processes presents several challenges which remain unsolved, including the following.

Complexity of inferring interactive human behavior. In this paper we will focus on structured human-machine interactions where one has modeled both human and machine behaviour/choices over time, and the setting arises repeatedly either by the same person or by a large population. The outcomes of such interactions can depend subtly on the history thus one can expect exponential complexity to be a challenge–unless the underlying processes have a ‘nice’ structure. Such assumption is essential for widely studied problems including MDP Put94 ; Ber00 ; BaR11 , reinforcement learning KLM96 ; SB98 , and multi-armed bandit problem BuC12 , where human decision-making processes are assumed to be independent across time, or have one-step Markov property. However, those assumptions are questionable according to studies on human cognition, see, e.g., MuB62 .

One recent work considering long-term dependency is deep Q-learning VKD15 , where authors used a complex neural network to capture the potential value of a state-action pair, where the state may incorporate complex historical information. However, because most interactive processes are transient, as both human and machine accumulate a history of decisions over time, one might expect the data requirements of carrying out deep Q-learning is quite high.

Biases in collecting data in interaction processes. Inferring a model for human behavior in the context of human-machine interaction process is also challenging because to collect data one must choose a machine policy to ‘interact’ with humans. This may in turn lead to ‘biased’ inferences of human behavior. In particular, a machine policy that focuses on ‘rewarding’ actions may preclude exploration/observation of other interaction modalities. Similar consideration was explored in partially observable MDP MeR05 to improve the efficiency of the solution, and in multi-armed bandit problem BuC12 ; AHK14 to achieve better exploration-exploitation tradeoff. However, data-driven models and inferences should respect the causal nature of human-machine interactions, but how to model/promote the randomness of a causal model remains unknown.

Robustness and exploration in optimizing machine interactions. A general data-driven framework for modeling and optimizing human-machine interaction processes might be viewed as involving two concerns. On one hand, engaging humans in interaction to collect data to infer models of human behavior, and on the other, using models of human behavior to choose machine policies optimizing interaction ‘rewards,’ i.e., the effectiveness of the sequence of machine actions in nudging human towards desirable outcomes. To that end, it is desirable to choose machine policies which are not overly sensitive to sampling noise in data collection and/or variability in human behavior. Also, of interest are policies that are not overly deterministic/predictable, as in some settings, such policies may be poor in keeping humans ‘engaged’, see, e.g., EpN15 , and poor in eliciting rich human-machine data sets.

Contributions. In this paper, we propose a data-driven framework to jointly solve the estimation and optimization problems associated with human-machine interaction processes. We adopt an inference technique based on a constrained maximum entropy principle for interactive processes, see ZBD11 ; ZBD13 . This allows one to incorporate prior knowledge of the (possibly) relevant features of human behavior, via moment constraints associated with interaction functions. We consider optimizing machine policies based on an interaction reward function with an entropy-based regularization term. This aims to find machine policies which maximize rewards, are robust to estimation noise, and maintain a degree of exploration when interacting with humans. Our proposed Alternating Reward-Entropy Ascent (AREA) Algorithm, alternates between data-collecting, estimation of human behavior, and the optimization of machine policy, with a view on reaching a consistent fixed point. We provide a characterization of various properties of AREA. In particular, for decomposable and/or path-based feature and reward functions, we devise a computationally efficient approach to estimation and optimization steps. The approach takes advantage of defining a stopping time over the interaction and the conditional Markov property of the estimated human model, to significantly reduce space and time complexity. We provide a theoretical characterization of the AREA algorithm in terms of its convergence, along with simple preliminary evaluation results based on synthetic data obtained from a noisy nonlinear model for human decision-making. All proofs are included in the appendix.

2 Problem Formation

We shall consider a structured discrete time human-machine interaction process over a period of time $1,2,\dots,T$ , which can be viewed as a pair of sequences of random variables, $(H_{1},\dots,H_{T})$ corresponding to human actions/responses if any, and $(M_{1},\dots,M_{T})$ denoting those of the machine. We shall assume the random variables $H_{t},M_{t}$ capture discrete human and machine actions at time $t$ , and, without loss of generality, that for all $t$ , $H_{t}\in\mathcal{H}$ , where $\mathcal{H}$ denotes the human’s action space, and $M_{t}\in\mathcal{M}$ , where $\mathcal{M}$ corresponds to the machine’s action space111Note both the human and/or machine could choose to do nothing in their turn. This can be included in our model by including null action in both $\mathcal{M}$ and $\mathcal{H}$ .. Throughout this paper we assume that $|\mathcal{H}|$ and $|\mathcal{M}|$ are finite. To simplify notation, we let $H^{t}=(H_{1},\ldots,H_{t})$ for $t=1,\ldots,T$ , and similarly define $M^{t}$ . When $t=0$ , $H^{t}$ or $M^{t}$ contains no elements. We assume that the human and the machine take turns, such that the machine’s action at time $t+1$ , i.e., $M_{t+1}$ depends on $H^{t},M^{t}$ while that of the human at time $t+1$ , i.e., $H_{t+1}$ depends on $H^{t},M^{t+1}$ . The joint distribution of $(H^{T},M^{T})$ captures the interplay between the human and machine. Depending on the setting, the human refers to a particular individual or a population, where the behavior can be captured via a stable distributional model.

We shall assume that when a human and machine interact, the machine’s policy is known and captured by a collection of conditional distributions $Q$ , for succinctness denoted by $Q(m_{t}|h^{t-1},m^{t-1}):=p_{M_{t}|H^{t-1},M^{t-1}}(m_{t}|h^{t-1},m^{t-1})~{}~{}\mbox{for}~{}~{}t=1,\ldots T.$ Similarly human behavior is denoted by conditional distributions $P$ given by $P(h_{t}|h^{t-1},m^{t}):=p_{H_{t}|H^{t-1},M^{t}}(h_{t}|h^{t-1},m^{t})~{}~{}\mbox{for}~{}~{}t=1,\ldots T.$ It is easy to show that joint distribution of $(H^{T},M^{T})$ , denoted by $PQ$ , resulting from a human model $P$ interacting with a machine policy $Q$ , can be decomposed as $PQ(h^{T},m^{T})=P(h^{T}||m^{T})Q(m^{T}||h^{T}),$ where

[TABLE]

correspond to the causally conditioned distributions of the human and the machine, i.e., products of sequentially conditioned distributions. We will assume that data of human-machine interactions can be collected by fixing a machine policy, and keeping track of the realizations of such interactions.

We let $P^{*}(h^{T}\|m^{T})$ denote the true human behavior and $\hat{P}(h^{T}\|m^{T})$ an estimated model thereof. We let $PQ(A)$ denote the probability of an event $A$ measurable w.r.t. $(H^{T},M^{T})$ and we let $E_{PQ}[f(H^{T},M^{T})]$ denote the expectation of a function $f(h^{T},m^{T}):{\cal H}^{T}\times{\cal M}^{T}\rightarrow\mathbb{R}$ under the joint distribution $PQ$ . When we collect interaction data of the human with a machine policy $Q(m^{T}\|h^{T})$ we denote expected value under the associated empirical distribution by $\hat{E}_{P^{*}Q}$ where in the ideal case (no noise) we have $\hat{E}_{P^{*}Q}[{f}(H^{T},M^{T})]=E_{P^{*}Q}[{f}(H^{T},M^{T})]$ . Those notations are summarized in Table 1.

2.1 Data-driven human model estimation

A brute force approach to modeling human behaviour would be to directly estimate the conditional distributions $\{P^{*}(h_{t}|h^{t-1},m^{t}),t=1,\ldots,T\}$ based on the collected data which is clearly not scalable. Instead, in this paper we embrace the extension of constrained maximum entropy estimation to interactive processes developed in ZBD13 ; WZB13 .

In this setting, one defines a set of feature functions ideally known to capture relevant characteristics of human behavior which become equality and inequality constraints in the estimation process. The choice of such features would be motivated by known frameworks for understanding human behavior in dynamic environments, e.g, the effort accuracy PBJ93 , exploration-exploitation MaM10 , soft constraints GrF04 , and specific character of the human-machine interaction. The equality constraints are based on matching the moments of a set of feature functions $\mathcal{F}$ denoted by $\mathbf{f}(h^{T},m^{T}):=\{f^{i}(h^{T},m^{T}),~{}i\in\mathcal{F}\}$ , and their moments based the empirical distribution when interacting with a given machine policy $Q$ , which are denoted by $\mathbf{c}_{f}:=\hat{E}_{P^{*}Q}[\mathbf{f}(H^{T},M^{T})]$ . Below we will neglect sampling errors by assuming that $\hat{E}_{P^{*}Q}[\mathbf{f}(H^{T},M^{T})]=E_{P^{*}Q}[\mathbf{f}(H^{T},M^{T})].$ The set of inequality constraints are denoted by $\mathbf{g}(h^{T},m^{T}):=\{g^{i}(h^{T},m^{T}),~{}i\in\mathcal{G}\}$ , where $\mathcal{G}$ is another set of feature functions whose moments are constrained not to exceed pre-specified thresholds $\mathbf{c}_{g}=\{c_{g}^{i},~{}i\in\mathcal{G}\}$ .

Formally, for a given machine policy $Q(m^{T}\|h^{T})$ , we are interested in models for human behaviour $P(h^{T}\|m^{T})$ satisfying the following constraints

[TABLE]

The maximum entropy estimation principle chooses the model for human behaviour in ${\mathcal{P}_{\mathcal{F},\mathcal{G}}^{Q}}$ with maximum entropy. In the case of interactive processes, since the machine policy $Q(m^{T}\|h^{T})$ is known we shall maximize the entropy of the causally conditioned distributions of the human behavior model. In particular, the causally conditioned entropy of human behaviour model ${P}(h^{T}\|m^{T})$ given machine policy in use is $Q(m^{T}\|h^{T})$ , is given by

[TABLE]

In the sequel we consider optimizing functionals of the causally conditioned distributions for the human (and the machine). Doing so means optimizing over a set of conditioned distributions $\{P(h_{t}|h^{t-1},m^{t})\,|\,t=1,\ldots,T\}$ , which for simplicity we also denote by $P(h^{T}\|m^{T}).$ It can be shown that these collections of distributions belong to a convex polytope denoted by ${\cal C}_{H}$ (resp. ${\cal C}_{M}$ ). Indeed, according to ZBD11 , $P(h^{T}\|m^{T})\in{\cal C}_{H}$ is equivalent to the requirement that $P(h^{T}\|m^{T})$ can be factorized into a product of conditional distributions as in (1). Similar result holds true for $Q(m^{T}\|h^{T})$ . This generalizes the notion of optimizing over a set of distributions with a given support, e.g., over the simplex. In the sequel for the sake of simplicity, we will omit the constraints $P(h^{T}\|m^{T})\in{\cal C}_{H}$ and $Q(m^{T}\|h^{T})\in{\cal C}_{M}$ when they appear in optimization problems–it is assumed to be understood that one is optimizing over causally conditioned distributions that must be properly normalized. The overall human estimation problem can thus be expressed as follows.

Definition 1

(Human estimation problem) Given a known machine policy $Q(m^{T}\|h^{T})$ and a set of moments $\mathbf{c}_{f}$ associated with human-machine interaction for equality constraints, the constrained maximum entropy estimate model for human behavior, say $\hat{P}(h^{T}\|m^{T})$ is the solution to the following problem:

[TABLE]

Note that since this problem is convex, the solution $\hat{P}(h^{T}\|m^{T})$ is unique. However, it depends on underlying machine policy $Q$ both through the cost function and the constraints.

2.2 Machine optimization

We assume one has defined a reward function $r(h^{T},m^{T})$ over human-machine interactions. This function might reflect both desirable human outcomes/decisions as well as machine costs for taking certain sequences of actions. Given an estimated model for the human behaviour, $\hat{P}(h^{T}\|m^{T})$ , one can in turn consider choosing a reward maximizing machine policy, i.e.,

[TABLE]

A direct optimization of the reward as above would result in machine policies that take deterministic actions associated with the ‘best’ choices. Such policies are likely to be vulnerable to the error in the estimated human behaviour model due to the sampling noise. This has also been observed in the context of reinforcement learning, see, e.g., AS97 ; S1996 . Such machine policies may also be limited in the degree to which ‘explore’ interaction with the human, and thus subsequently the obtained interaction data may lead to poor estimates of human behavior and sub-optimal results. Further, we also posit that deterministic machine policies have poor characteristics from a human interaction perspective, e.g., might also be boring/too predictable, leading to poor engagement EpN15 , and/or in certain settings may be unfair. For example, in an advertising setting, one might want to incorporate randomness in placing advertisements to ensure fairness and/or encourage competition.

To address these concerns we propose adding a ‘regularizing’ entropy term to the reward function. Thus given an estimated model for human behavior $\hat{P}$ , the machine’s policy is obtained as the solution to the following problem.

Definition 2

(Machine policy optimization problem) Given an estimated model for human behavior $\hat{P}(h^{T}\|m^{T})$ , the reward maximizing machine policy is given by the solution to

[TABLE]

where $\gamma>0$ controls the degree to which one weighs entropy versus reward in the machine policies. We shall realize that this formulation is in fact similar to human estimation problem introduced earlier.

2.3 Closing the loop: Alternating Reward-Entropy Ascent (AREA) Algorithm

Note that the optimized machine policy obtained via (5) depends on a estimated model for human behavior, which in turn was estimated by solving (4) based on data obtained from human-machine interactions using the previously selected machine policy. The two machine policies need not to be the same, possibly making the estimation and optimization steps inconsistent. To resolve this, we propose Alternating Reward-Entropy Ascent (AREA) algorithm exhibited in Figure 1. We begin with a default machine policy (for example, the machine might choose actions at random), denoted by $\hat{Q}^{(0)}(m^{T}\|h^{T})$ . Under this machine policy we collect data/realizations of human machine interactions. Then from the data, we can estimate the feature moments, which, in turn, enable estimation of a model for human behavior $\hat{P}^{(0)}$ through our inference phase, i.e., (4). Based on the estimated model of human behavior we generate a new machine policy through the machine optimization phase, where the optimization is based on $\hat{P}^{(0)}$ , obtaining the next machine policy $\hat{Q}^{(1)}$ . This alternating process generates a sequence of causally conditioned distributions given by $\hat{Q}^{(0)}\rightarrow\hat{P}^{(0)}\rightarrow\hat{Q}^{(1)}\rightarrow\hat{P}^{(1)}\rightarrow\dots$ , which we refer to as AREA iterations.

3 Related Work

Markov decision processes and reinforcement learning: The optimization of human-machine interactions can in principle be modelled as a Markov Decision Process (MDP), where the human behavior can be viewed as driven by a transition kernel among a set of states, and the machine behavior corresponds to a sequence of actions taken in response to the human’s behavior. The underlying assumptions are that there exists a state space for the human and an action space such that the distribution of future states depends only on the current state (say of the human) and chosen action (say of the machine). In such a setting, one can define a reward function and consider optimizing the associated machine policy, see e.g., Put94 ; Ber00 ; BaR11 . When the transition kernel is unknown, but assumed Markov, the resulting problem is known as reinforcement learning, see e.g., surveys KLM96 ; SB98 . Both model-based and model-free reinforcement learning approaches (and methods that combine both approaches) have been studied in the literature. Model based methods combine estimation of the environment and optimization of machine actions, while model-free methods aim to directly optimize the machine without first estimating a model the environment. For example, Q-learning aims to directly estimate the value of state-action pair, denoted by $Q(s,a)$ , where the $s$ is the current state and $a$ is the candidate action. The $Q$ -function can be used to select the optimal sequence of machine actions KLM96 ; SB98 . The traditional framework of reinforcement learning relies heavily on the assumption that the underlying environment is Markov. However as deep learning technologies emerge, deep Q-learning VKD15 have been devised to approximately solve this problem. Indeed to overcome the difficulties brought by non-Markov environments, an option is to first enlarge the state space of the underlying environment significantly, for example, to include all possible history of the system. Then use a deep neural network to encode the $Q$ -function, and fit the neural network to the observed data. This approach also has its challenges in terms of demanding data requirements and might not be applicable to some use cases.

In our framework, when the reward function is decomposable over time i.e. $r(h^{T},m^{T})=\sum_{t=1}^{T}r(h_{t},m_{t})$ , and the estimated human model $\hat{P}(h^{T}||m^{T})$ is one-step Markov, the machine optimization program reduces to a traditional MDP setting, with a possible time-inhomogeneous transition kernel and the reward function is regularized by an entropy term to promote exploring different actions.Some recent literature suggests that model based methods may be preferred to model free methods in terms of sample complexity AS97 ; S1996 . In the special case of a Markov model, our approach may be considered as a variation of model-based reinforcement learning, where the model is learned by maximizing causal entropy subject to moment constraints, and the machine behavior is regularized using the causal entropy of the machine process. As discussed, this analogy no longer holds for the general case.

We are aware of only a few cases where (relative) entropy regularization has been combined with Markov decision processes and related models. GRW14 consider a generalization of the Markov decision process where, instead of impacting the process through some actions, the agent can directly manipulate the transition matrix of the system state. However, such manipulation would incur some cost which is proportional to the relative entropy between the transition probability after manipulation, and the transition matrix of a ‘passive’ process which models the ‘natural’ system evolution. In MeR05 , the authors propose an entropy-regularized cost function to approximately solve a partially observable Markov decision process (POMDP) model efficiently. Due to the absence of knowledge of the exact system state (i.e. partial observation), the agent must estimate it through the reward it receives and a noisy observation of its current state. Therefore, there is a trade-off between gaining more profit based on current belief – which requires focusing on the most profitable action, and improving the quality of estimation – which requires exploring different actions. The authors of MeR05 used the expectation of entropy in the agent’s belief state as a proxy of how well it explored different actions. The main challenge associated with MDP is that the human’s behavior transition kernels may have long term dependencies – and an extremely large state space may be required state to remain in the Markov setting.

*Bandit problem: * The state-of-the-art approach to solving the problems with such sequential and interactive context also includes multi-armed bandit problem and its variants BuC12 , which are widely discussed and used in use cases including computational advertising. In such context, the search engine uses the user feature including gender, age and searching history as the context, to pick up an ad, which is modeled as the arm, after each user’s query, such that the user will have a good chance of clicking through the ad. The most representative method is the ILOVETOCONBANDITS algorithm proposed in AHK14 , where it is assumed that the reward received for each attempt depends on some observable random ‘context’. However the approach depends heavily on the i.i.d. assumption on the environment, in order to improve the quality of the estimation by accumulating samples. Therefore when the user does not make independent decisions or has a long-term memory, the performance of such contextual bandit based solutions will not be acceptable.

The most general way to model such problems in a multi-armed bandit way is continuum armed bandit, for example, TyG13 , where the arms to pick can be a vector of real numbers instead of discrete index. We can directly model the machine’s policy $Q(m^{T}\|h^{T})$ as arms. However when the support of the arm is big, the convergence of the algorithm is slow, and also it requires a prior knowledge of the number of iterations we need, thus cannot be implemented in a fully online manner.

4 Solution to AREA’s Optimization Problems

The Lagrangians for the optimization problems (4) and (5) have similar forms. We shall begin our discussion of the solution approach, based on ZBD13 , for the human estimation problem and subsequently that of the machine optimization, pointing out some key results and notation that will be critical for our development in the sequel.

4.1 Solution to human estimation problem

It has been shown in ZBD13 that the human estimation problem is concave in $P(h^{T}\|m^{T})$ given $Q(m^{T}\|h^{T})$ , and the solution can be found by its dual.

Theorem 1

ZBD13 * The dual form of the human estimation problem (4) is given by:*

[TABLE]

where

[TABLE]

and

[TABLE]

The associated human model for dual variables $\boldsymbol{\lambda}$ is given by ${P}_{\bm{\lambda}}(h_{t}|h^{t-1},m^{t})=\frac{Z_{\bm{\lambda}}(h_{t}|h^{t-1},m^{t})}{Z_{\bm{\lambda}}(h^{t-1},m^{t})}.$

The optimal dual $\bm{\lambda}^{*}$ can be found by subgradient-based method, see ZBD13 or Appendix A. In the sequel it will be useful to denote the solution to the human estimation problem by $h^{*}(Q,\mathbf{c}_{f},\mathbf{c}_{g})$ to make clear its dependence on $Q$ the machine policy, $\mathbf{c}_{f}$ the feature moments estimated from human-machine interactions, and the constants $\mathbf{c}_{g}.$

The solution given in Theorem 1 has several interpretations, two of which are given in following two theorems.

Theorem 2

ZBD13 * Using statistics from the true distribution without sampling error, maximizing the causal entropy subject to feature constraints in human estimation problem is equivalent to maximizing the log causal likelihood of the true distribution over the family of causal Gibbs distributions.*

[TABLE]

Theorem 3

ZBD13 * The human estimation problem is equivalent to minimizing the worst case causal log-loss when the true human behavior is chosen adversarially.*

[TABLE]

4.2 Solution to machine optimization problem

It should be clear at this point that the the objective function in (5) is similar to the Lagrangian of Problem (4) with a fixed ‘dual variable’ $\gamma$ . Thus the following result is fairly straightforward.

Theorem 4

For a given model of human behavior $\hat{P}(h^{T}\|m^{T})$ the solution to the machine optimization problem (5), $\hat{Q}(m^{T}\|h^{T})$ is given as follows. Let

[TABLE]

where $Y_{\gamma}(h^{t},m^{t})=\sum_{m_{t+1}}Y_{\gamma}(m_{t+1}|h^{t},m^{t}),~{}~{}Y_{\gamma}=\sum_{m_{1}}Y_{\gamma}(m_{1})$ . Then the optimal machine policy is $\hat{Q}(m_{t}|h^{t-1},m^{t-1})=\frac{Y_{\gamma}(m_{t}|h^{t-1},m^{t-1})}{Y_{\gamma}(h^{t-1},m^{t-1})}$ and $\hat{Q}(m_{1})=\frac{Y_{\gamma}(m_{1})}{Y_{\gamma}}.$

Please see Appendix B for detailed proof.

In the sequel it will be useful to represent the result stated in Theorem 4 as follows. In particular the auxiliary function $\mathbf{Y}_{\gamma}:=\{Y_{\gamma}(m_{t}|h^{t-1},m^{t-1}),\forall 1\leq t\leq T\}$ generated by the procedure given in Theorem 4 depends on the human model and so is denoted by $\mathbf{Y}_{\gamma}=m(\hat{P}).$ The associated optimal machine policy $\hat{Q}$ is in turn a function of $\mathbf{Y}_{\gamma}$ denoted by $\hat{Q}=m^{*}(\mathbf{Y}_{\gamma}).$

5 Complexity of AREA Algorithm

As can be seen, the dual problem of human estimation problem is over a vector $\boldsymbol{\lambda}$ of dimension $|\mathcal{F}|+|\mathcal{G}|$ . The authors of ZBD13 shows that we can find the optimal dual variables by a recursion only involves computing the expectation of feature functions, respect to joint distribution $P_{\boldsymbol{\lambda}}Q$ , where $P_{\boldsymbol{\lambda}}$ is the human distributional model associated with $\boldsymbol{\lambda}$ . However when updating the dual variables, computing those expectations are intractable in the most general setting. Specifically, if we define the space complexity as the number of variables that need to be stored, and the time complexity as the number of basic math operations (e.g. addition, multiplication and exponential function evaluation) required to carry out the update, we can see that because the number of conditioning sequences in (10)grows exponentially in $T$ , thus if we need to put all conditional PMFs into the memory and then compute the joint PMF accordingly, both space and time complexities required are exponential in $T$ . Fortunately, when the feature functions have specific forms, the complexity of computing such updates can be reduced. Specifically, we will discuss cases where one iteration of AREA algorithm described in Section 2.3 has polynomial complexity in $T$ .

Definition 3

A feature function $f(h^{T},m^{T})$ is said to be decomposable if it can be written as $f(h^{T},m^{T})=\sum_{t=1}^{T}f_{t}(h_{t},m_{t}).$

Definition 4

*A function $f(h^{T},m^{T})$ is said to be path-based if it is proportional to the indicator function of a specific realization of the human-machine interaction, say $(\bar{h}^{T},\bar{m}^{T})$ , i.e., $f(h^{T},m^{T})=c\mathbf{1}_{\{(h^{T},m^{T})=(\bar{h}^{T},\bar{m}^{T})\}}.$ *

Note that it is always desirable to include the reward function in the equality feature set ${\cal F}$ to ensure that the estimated human model matches the true human behavior in terms of the associated mean rewards. Then we have the following result.

Theorem 5

*Suppose the reward function $r(h^{T},m^{T})$ can be written as a sum of a decomposable function and a set $\mathcal{R}_{p}$ of path-based functions, and the remaining feature functions are either decomposable or path-based, i.e., $\mathcal{F}=\mathcal{F}_{p}\cup\mathcal{F}_{d}\cup\{r(h^{T},m^{T})\}$ and $\mathcal{G}=\mathcal{G}_{p}\cup\mathcal{G}_{d}$ , where $\mathcal{F}_{p}$ and $\mathcal{G}_{p}$ denote path-based equality/inequality features, and $\mathcal{F}_{d}$ and $\mathcal{G}_{d}$ decomposable equality/inequality features, respectively. Suppose further that the initial machine’s policy $\hat{Q}^{(0)}$ is uniformly random. Then the space complexity of each dual update of human estimation problem is $O\left((|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)T|\mathcal{H}||\mathcal{M}|\right)$ , and the time complexity of each dual update is $O(T(|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)\max(T,|\mathcal{H}||\mathcal{M}|))$ . The time and space complexity of the machine optimization problem are both $O((|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)T|\mathcal{H}||\mathcal{M}|)$ . *

Please see Appendix C for detailed proof.

Remark: We envisage that the inclusion of path-based and decomposable feature and reward functions might allows a fairly rich framework to capture relevant interaction characteristics. In particular path-based features are capable of modeling detailed long-term memory in human-machine interactions while decomposable features can model short-term dependencies. As shown in Theorem 5, for such settings, the solution to (4) and (5) require steps with only polynomial space and time complexity.

6 AREA Convergence

As discussed in previous sections, the AREA Algorithm is aimed at achieving high rewards through consistency in the estimated human model and optimized machine policy. In this section, we characterize AREA’s convergence properties.

The convergence of the algorithm can be guaranteed in two extremal cases. Clearly if the set of feature functions $\mathcal{F}$ and $\mathcal{G}$ is rich enough that the true human behaviour is recovered as the solution to (4), then AREA converges–see Appendix D for details. Or, if the feature set is sufficient to guarantee that the actual human behavior along the ‘paths’ that are impactive to reward is perfectly captured by the estimated human model, then AREA also converges in one iteration.

Theorem 6

Suppose the reward function is a path-based, i.e., $r(h^{T},m^{T})=\mathbf{1}_{\{(h^{T},m^{T})=(\bar{h}^{r,T},\bar{m}^{r,T})\}}$ and included in the feature set $\mathcal{F}$ and the initial machine’s policy $\hat{Q}^{(0)}$ has full support. Consider a modified version of human estimation problem which includes the following additional features. For each path-based feature, i.e., $i\in\mathcal{F}_{p}$ , we include $T-1$ auxiliary features $\mathcal{F}_{p}^{i}$ as follows:

[TABLE]

ensuring matching of full-length and prefixes for the path based features. For the modified set of equality features $\mathcal{F}=\mathcal{F}_{d}\cup(\bigcup\limits_{i\in\mathcal{F}_{p}}\mathcal{F}_{p}^{i})$ . and an arbitrary set of inequality features $\mathcal{G}$ AREA converges in one iteration.

Please see Appendix E for proof. Note that by following a similar argument as in the proof of Theorem 5, one can show that features included in $\mathcal{F}_{p}^{i}$ will not undermine the polynomial complexity. One just needs to keep track of the prefix overlap between the current conditioning sequence and the support of each path-based feature, in order to compute the associated $Z_{\bm{\lambda}}$ .

For more general cases, the convergence of AREA algorithm is subtle. Note that the human estimation problem (4) depends on the machine policy $Q(m^{T}\|h^{T})$ used. Thus given $Q(m^{T}\|h^{T})$ at the current iteration one can determine the associated model for human behavior $h^{*}(Q,\mathbf{c}_{f},\mathbf{c}_{g})$ which may in turn change the optimal machine policy. This makes the analysis of convergence difficult. In order to facilitate the convergence, we propose introducing an additional inequality constraint to the human estimation problem (4).

During the $n$ th iteration, given the previously obtained $\hat{P}^{(n-1)}$ and $\hat{Q}^{(n)}$ we shall include the following step-dependent inequality constraint in $\mathcal{G}$ . Let $g^{0,(n)}(h^{T},m^{T})=-\log\hat{Q}^{(n)}(m^{T}\|h^{T})+\gamma r(h^{T},m^{T}),$ and let $c_{g}^{0,(n)}=E_{\hat{P}^{(n-1)}\hat{Q}^{(n)}}[g^{0,(n)}(H^{T},M^{T})]$ , then on AREA iteration $n$ we require that $E_{P\hat{Q}^{(n)}}[g^{0,(n)}(H^{T},M^{T})]\geq c_{g}^{0,(n)}.$

Let us define a sequence $\{L^{(n)}\}$ of entropy regularized expected rewards across iterations, i.e., $L^{(n)}:=E_{\hat{P}^{(n)}\hat{Q}^{(n)}}[g^{0,(n)}(H^{T},M^{T})].$ Then we have the following result.

Theorem 7

Consider the AREA algorithm optimizing a human-machine interactive process with a fixed sets of equality/inequality constraints ${\cal F}$ and ${\cal G}$ . Suppose ${\cal G}$ is modified to ${\cal G}^{(n)}$ by adding the additional step-dependent inequality constraint $E_{P\hat{Q}^{(n)}}[g^{0,(n)}(H^{T},M^{T})]\geq c_{g}^{0,(n)}$ . Then the modified AREA iterations generate a bounded nondecreasing sequence $\{L^{(n)}\}$ , which must converge.

The proof is included in Appendix F.

Remark: Note that when the conditions in Theorem 5 holds true, then $\hat{Q}^{(n)}$ takes independent actions once the path deviates from the support of all path-based feature functions. Thus the introduced step-dependent feature function can be written as: $g^{0,(n)}(h^{T},m^{T})=-\sum_{t=1}^{T}\log\hat{Q}_{t}^{(n)}(m_{t})+\sum_{i\in\mathcal{F}_{p}\cup\mathcal{G}_{p}}\left(\sum_{t=1}^{T}\log\hat{Q}_{t}^{(n)}(\bar{m}^{i}_{t})-\log\hat{Q}^{(n)}(\bar{m}^{i,T}\|\bar{h}^{i,T})\right)\mathbf{1}_{\{h^{T}=\bar{h}^{i,T},m^{T}=\bar{m}^{i,T}\}},$ which is still a weighted sum of path-based functions and decomposable functions. This in turn means that the added constraint is such that iteration steps will still have the polynomial complexity shown in Theorem 5.

$L^{(n)}$ can be regarded as a measure of the performance of the associated machine policy $\hat{Q}^{(n)}$ . Indeed if $\hat{Q}^{(n)}$ were a fixed point of AREA recursion, then the optimal objective function of (5) would have converged to $L^{(n)}$ . Also note that by further assuming that the feature and reward functions are decomposable, we can characterize the performance for the converging sequence $\{L^{(n)}\}$ –see Appendix G.

7 Evaluation

In this section, we conduct a preliminary numerical evaluation of AREA using synthetic human-machine interaction data based on the Leaky Competing Accumulator (LCA) model, see UsM01 . This non-linear noisy model is known to capture common human decision-making processes driven by external stimuli.

7.1 Robustness against sampling noise

Throughout the paper we have assumed no sampling noise when estimating the moments of features. In practice the available data may be limited or costly and thus noisy estimates are inevitable. The robustness of maximum entropy inference against such noise is mathematically characterized in Theorem 6 of ZBD13 . In this section, we will explore the robustness of the AREA algorithm to noise when the number of samples per iteration are limited. The detailed set-up of the LCA model for human-machine interactions is included in Appendix H .

We consider a setting where $T=30$ , $\mathcal{H}=\{1,\ldots,6\}$ , $\mathcal{M}=\{1,\ldots,6\}$ and $\gamma=2$ . The reward function is $r(h^{T},m^{T})=\sum_{t=1}^{T}r_{t}(h_{t},m_{t})$ , where $r_{t}(h_{t},m_{t})=\mathbf{1}_{\{t~{}\textrm{mod}~{}5=0\}}\mathbf{1}_{\{h_{t}=1\}}+\mathbf{1}_{\{t~{}\textrm{mod}~{}5\neq 0\}}\mathbf{1}_{\{h_{t}\neq 1\}}$ , i.e., we are looking to funnel the human behavior to choosing 1 only at $t=5,10,\cdots,30.$ The features include the reward function itself, together with the number of times human follows the machine $f^{1}(h^{T},m^{T})=\sum_{t=1}^{T}\mathbf{1}_{\{h_{t}=m_{t}\}}$ , and a ‘weighted’ number of times of following occurs emphasizing later times, i.e., $t=5,10,\cdots,30$ , $f^{2}(h^{T},m^{T})=\sum_{t=1}^{T}f^{2}_{t}(h_{t},m_{t})$ , where $f^{2}_{t}(h_{t},m_{t})=(\mathbf{1}_{\{t~{}\textrm{mod}~{}5=0\}}+0.25\mathbf{1}_{\{t~{}\textrm{mod}~{}5\neq 0\}})\mathbf{1}_{\{h_{t}=m_{t}\}}$ . The challenge here is for the machine to learn to drive human (nonlinear model) away from 1 and back to 1 periodically.

The results in Fig. 2(a) exhibit the convergence of the regularized reward function $L$ vs the number of AREA steps, when different numbers of samples are used to estimate the moments in AREA’s inference step. Clearly, AREA converges almost immediately although it exhibits variations when small samples ( $\leq 100$ ) are used.

7.2 Performance in average reward and causally conditioned entropy

Next we compare the performance of AREA to a simple Q-learning algorithm KLM96 with finite memory. We shall compare the attained reward and empirical causally conditioned entropy of the optimized machine policies. In this setting the humans’ actions are viewed as the environment. Thus, instead of ‘scoring’ each action based on the most recent humans’ response, Q-learning scores each action based on the most recent $\tau$ interactions, together with $t$ to accommodate the transient nature of the process, i.e., it keeps track of $Q((h_{t-\tau+1},\ldots,h_{t},m_{t-\tau+1},\ldots,m_{t}),t+1,m_{t+1})$ .

At time $t$ , the machine chooses an action using a softmax of the $Q$ function given the latest interaction history $(h_{t-\tau+1},\ldots,h_{t},m_{t-\tau+1},\ldots,m_{t})$ and $t+1$ , and then updates the $Q$ function accordingly. We shrink the state space to $|\mathcal{H}|=|\mathcal{M}|=3$ and $T=20$ so the $Q$ function fits in the memory and also change $\gamma$ to 4 to put more emphasis on reward. We shall consider the same rewards and features as in Section 7.1. of AREA. We will let both algorithms complete 100 ‘interactions’ with our synthetic human model. For AREA, we collect 10 human-machine interaction samples per AREA iteration, and run 10 iterations in total. For Q-learning we also allow a total of 100 interactions. We set $\tau=1$ since further experiments show that greater $\tau$ impairs the performance of Q-learning for it requires more samples to learn. The detailed setup for Q-learning can be found in Appendix H. We kept track of the average reward obtained, estimated causally conditioned entropy of machine obtained for both algorithms after integrating the first $n$ samples. We run the simulation 5 rounds to obtain the average, and the results, together with the 90% confidence intervals are shown in Figure 2(b). These representative results suggest that typically AREA algorithm is very efficient, delivering higher rewards than Q-learning while at the same time realizing (as desired) higher machine policy entropy with a very limited number of samples.

8 Conclusions

The paper proposes a general data-driven framework to optimize possibly complex human-machine interaction processes. At the core is the AREA algorithm which jointly solves the problem of estimating a model for human behaviour and optimizing the machine policy based on a constrained maximum entropy estimation. An underlying goal is to enable the integration of domain-specific knowledge regarding relevant interaction characteristics or known human biases by matching the observed moments of feature functions. The paper details the formal optimization problems and solutions underlying the AREA algorithm and explores a modification to significantly reduce the complexity when the feature and reward functions are path-based and/or decomposable. The setting considered is fairly general, allowing one to incorporate human-machine interactions with long memory. The characterization of AREA is provided in terms of ( $i$ ) its space and time complexity, and ( $ii$ ) its convergence in various settings. A simple numerical evaluation is used to demonstrate the robustness of AREA to noise when sample sizes are limited, along with a performance comparison to Q-learning. The analysis and simple validation suggest that AREA may achieve most of its gains in one iteration particularly if sufficient domain specific features/rewards are properly integrated.

Appendix A Solution to dual of Problem (4)

Theorem 4 in [14] shows the strong duality of Problem (4). Therefore, the optimal dual induces the optimal primal solution. If we find a $\bm{\lambda}^{*}$ minimizing (6), the estimated human model $\hat{P}$ is given by $P_{\bm{\lambda}^{*}}$ . The dual problem can be solved via a subgradient-based algorithm. In particular, if we use an adaptive learning rate $\eta^{(n)}\in\mathbb{R}^{+}$ , the dual variable should be updated by

[TABLE]

where $\mathbf{c}_{f}=E_{P^{*}Q}[\mathbf{f}(H^{T},M^{T})]$ are the moments of the feature functions associated with the equality constraints obtained from the human-machine interaction data in the inference step, and the gradients are computed using the recursive form defined in Theorem 1. Then the sequence $\{\boldsymbol{\lambda}^{(n)}\}$ converges to the optimal dual $\boldsymbol{\lambda}^{*}$ .

Appendix B Proof of Theorem 4

The machine optimization problem (5) can be shown to be concave in $Q$ thus one can directly solve it via first-order optimality conditions. Considering the variables to be $\{Q(m_{t}|h^{t-1},m^{t-1}),t=1,2,\ldots T,h^{t-1}\in\mathcal{H}^{t-1},m^{t}\in\mathcal{M}^{t}\}$ , the Lagrangian associated with Problem (5) can be written as:

[TABLE]

where $\beta(h^{t-1},m^{t-1})$ for $t=1,\ldots,T$ denote dual variables associated with the respective normalization constraints $\sum_{m_{t}}Q(m_{t}|h^{t-1},m^{t-1})=1$ . By differentiating the Lagrangian we have

[TABLE]

where $\mathbb{H}_{\hat{P}Q}(M^{T}\|H^{T}|h^{t-1},m^{t})$ is the further conditioned, causally conditioned entropy, defined as:

[TABLE]

After plugging $Y_{\gamma}$ defined recursively in Theorem 4, and setting $\beta(h^{t-1},m^{t-1})=\hat{P}Q(h^{t-1},m^{t-1})+\log Y_{\gamma}(h^{t-1},m^{t-1})\hat{P}Q(h^{t-1},m^{t-1})$ , we can show that $\nabla_{Q(m_{t}|h^{t-1},m^{t-1})}\Lambda(Q,\beta)=0.$ Thus the optimal solution is achieved.

Appendix C Proof of Theorem 5

Before proving Theorem 5, let us first consider a simpler case where only decomposable features are included in Problem (4). The following corollary to Theorem 1 characterizes a case where the complexity of the solution is polynomial in $T$ .

Lemma 1

Suppose the machine’s policy is given by a (possibly time-inhomogeneous) one-step Markov process, i.e., $Q(m_{t}|h^{t-1},m^{t-1})=Q(m_{t}|h_{t-1},m_{t-1}),~{}\forall t,m^{t-1},h^{t-1}$ , and all feature functions are decomposable, i.e., $\mathbf{f}(h^{T},m^{T})=\sum_{t=1}^{T}\mathbf{f}_{t}(h_{t},m_{t})$ , and $\mathbf{g}(h^{T},m^{T})=\sum_{t=1}^{T}\mathbf{g}_{t}(h_{t},m_{t})$ . Then the solution to the human estimation problem is given, by the following procedure over a given dual $\boldsymbol{\lambda}=(\boldsymbol{\lambda}_{f},\boldsymbol{\lambda}_{g})$ :

[TABLE]

where

[TABLE]

Moreover, both the space and time complexity of establishing the distributional model is $O(T|\mathcal{H}||\mathcal{M}|)$ . The complexity of carrying out each dual update is $O(T(|\mathcal{F}|+|\mathcal{G}|)|\mathcal{H}|^{2}|\mathcal{M}|^{2})$

Proof:

We’ll prove that under such assumption, $Z_{\boldsymbol{\lambda}}(h_{t}|h^{t-1},m^{t})$ in Theorem 1 is given by:

[TABLE]

where $Z_{\boldsymbol{\lambda}}(m_{t+1})$ is given as in Lemma 1.

The above equation implies that:

[TABLE]

and the Markov property follows.

For $t=T$ , the identity holds true trivially. Now suppose it is true for $t+1$ . Then according to Theorem 1, for $t<T$

[TABLE]

Then when we compute the ratio $\frac{Z_{\boldsymbol{\lambda}}(h_{t}|h^{t-1},m^{t})}{Z_{\boldsymbol{\lambda}}(h^{t-1},m^{t})}$ the term $e^{(\boldsymbol{\lambda}_{f})^{T}\sum_{\tau=1}^{t-1}\mathbf{f}_{\tau}(h_{\tau},m_{\tau})+(\boldsymbol{\lambda}_{g})^{T}\sum_{\tau=1}^{t-1}\mathbf{g}_{\tau}(h_{\tau},m_{\tau})}$ cancels out.

For the complexity, it’s easy to see that in total we need to compute $T|\mathcal{H}||\mathcal{M}|$ probabilities. Thus the space complexity is $O(T|\mathcal{H}||\mathcal{M}|)$ . If the vector multiplication is viewed as a basic operation, then computing each $Z_{\bm{\lambda}}(h_{t}|m_{t})$ involves the sum of at most three vector inner products, and evaluating of its exponentiation. Therefore, the time complexity involved in establishing the distributional model is also $O(T|\mathcal{H}||\mathcal{M}|)$ .

When computing the expectation of the feature functions, note that since all feature functions are decomposable, for all $i$ :

[TABLE]

And

[TABLE]

Suppose we already obtained $P_{\bm{\lambda}}Q(m_{t-1})$ , then

[TABLE]

Note that the marginal distribution of $m_{1}$ is given by $P_{\bm{\lambda}}Q(m_{1})$ = $Q(m_{1})$ , which is already available. Thus we can compute $E_{P_{\bm{\lambda}}Q}[f^{i}_{t}(H_{t},M_{t})]$ from $t=1$ to $t=T$ and store $P_{\bm{\lambda}}Q(m_{t}),\forall 1<t\leq T,m_{t}\in\mathcal{M}$ . Then it is straightforward that computing $E_{P_{\bm{\lambda}}Q}[f^{i}_{t}(H_{t},M_{t})]$ involves $|\mathcal{H}|^{2}|\mathcal{M}|^{2}$ operations, and computing $E_{P_{\bm{\lambda}}Q}[f^{i}(H^{T},M^{T})]$ is of complexity $O(T|\mathcal{H}|^{2}|\mathcal{M}|^{2})$ . Each dual update involves evaluation of $E_{P_{\bm{\lambda}}Q}[f^{i}(H^{T},M^{T})],~{}\forall~{}i\in\mathcal{F}$ , and $E_{P_{\bm{\lambda}}Q}[g^{i}(H^{T},M^{T})],~{}\forall~{}i\in\mathcal{G}$ , thus has the complexity of $O(T(|\mathcal{F}|+|\mathcal{G}|)|\mathcal{H}|^{2}|\mathcal{M}|^{2})$ . $\square$

Now let us assume that the equality and inequality constraint sets can be each partitioned into two subsets: $\mathcal{F}=\mathcal{F}_{p}\cup\mathcal{F}_{d}$ , and $\mathcal{G}=\mathcal{G}_{p}\cup\mathcal{G}_{d}$ , where $\mathcal{F}_{d}$ and $\mathcal{G}_{d}$ correspond to the decomposable features and $\mathcal{F}_{p}$ and $\mathcal{G}_{p}$ correspond to the path-based features. Moreover, the path-based features are:

[TABLE]

while decomposable features are:

[TABLE]

Also, the reward function is given by its path-based part $r^{i,p}(h^{T},m^{T})=c_{i}\mathbf{1}_{\{(h^{T},m^{T})=(\bar{h}^{i,T},\bar{m}^{i,T})\}},~{}~{}i\in\mathcal{R}_{p},$ , together with a decomposable part $r^{d}(h^{T},m^{T})=\sum_{t=1}^{T}r^{d}_{t}(h_{t},m_{t})$ , giving

[TABLE]

First let us consider the human estimation problem. Note that if the conditioning sequence is not a prefix of any path-based feature function (including functions in $\mathcal{R}_{p}$ ), the backward recursion in Theorem 1 is equivalent to the case where we only have decomposable feature functions.

Without loss of generality, consider decomposable feature functions given by:

[TABLE]

and

[TABLE]

We shall let $\boldsymbol{\lambda}_{f}^{d}$ be the dual variable corresponding to the decomposable equality constraints, $\boldsymbol{\lambda}_{g}^{d}$ that corresponding to the decomposable inequality constraints, and $\lambda_{r}$ that corresponding to the reward function. Note that when we establish the distributional model, functions in $\mathcal{R}_{p}$ together with $r^{d}(h^{T},m^{T})$ can be regarded as individual ‘feature’ functions, which share the same dual variable $\lambda_{r}$ . It follows from Lemma 1 that if $(h^{t},m^{t})\neq(\bar{h}^{i,t},\bar{m}^{i,t}),~{}\forall i\in\mathcal{F}_{p}\cup\mathcal{G}_{p}\cup\mathcal{R}_{p}$ , we have

[TABLE]

where $Z_{\boldsymbol{\lambda}}(h_{t}|m_{t})$ is given by the recursion specified in Lemma 1, with $r^{d}(h^{T},m^{T})$ as a feature function. Let us denote the set of machine actions at time $t$ that stay on at least one path-based feature function’s support, by $\mathcal{M}_{t}^{p}(h^{t-1},m^{t-1})=\{m_{t}|\exists i\in\mathcal{F}_{p}\cup\mathcal{G}_{p}\cup\mathcal{R}_{p}~{}s.t.~{}h^{t-1}=\bar{h}^{i,t-1},m^{t}=\bar{m}^{i,t}\}$ and a similar set of human actions, by $\mathcal{H}_{t}^{p}(h^{t-1},m^{t})=\{h_{t}|\exists i\in\mathcal{F}_{p}\cup\mathcal{G}_{p}\cup\mathcal{R}_{p}~{}s.t.~{}h^{t}=\bar{h}^{i,t},m^{t}=\bar{m}^{i,t}\}$ . For $\bar{h}^{i,t},\bar{m}^{i,t}$ , the backward recursion in Theorem 1 becomes following:

[TABLE]

From the result of Lemma 1,

[TABLE]

Note that $B$ solely depends on the result of the case where there are only decomposable features. The additional complexity introduced is in the computation of $A$ , which is determined by the number of nonzero path-based features after current step. The key insight is that we only need to track $A$ for a prefix where there is at least one nonzero path-based feature function, and the set of possible choices of such prefixes forms a tree where the number of leaf nodes is at most $|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|$ . Then at each $t$ , we need to compute $A$ for at most $|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|$ conditioning prefixes. Thus the complexity of obtaining the whole distributional model is $O((|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)T|\mathcal{H}||\mathcal{M}|)$ .

When computing the mean of a feature function $E_{P_{\boldsymbol{\lambda}}Q}[f^{i}(H^{T},M^{T})]$ , we have two different cases:

If $f^{i}(h^{T},m^{T})$ is a path-based feature with support $\bar{h}^{i,T},\bar{m}^{i,T}$ . Then $E_{P_{\boldsymbol{\lambda}}Q}[f^{i}(H^{T},M^{T})]=P_{\boldsymbol{\lambda}}Q(\bar{h}^{T},\bar{m}^{T})=c_{i}\prod_{t=1}^{T}Q(\bar{m}^{i}_{t}|\bar{h}^{i,t-1},\bar{m}^{i,t-1})P(\bar{h}^{i}_{t}|\bar{h}^{i,t-1},\bar{m}^{i,t})$ . This requires at most $T$ multiplications. 2. 2.

If $f^{i}(h^{T},m^{T})$ is a decomposable feature then the associated moment can be written as

[TABLE]

Let us define a stopping time $T_{D}$ w.r.t. $(H^{T},M^{T})$ such that $T_{D}:=\min~{}\{t~{}|~{}1\leq t\leq T,(H^{t},M^{t})\neq(\bar{h}^{i,t},\bar{m}^{i,t}),~{}\forall i\in\mathcal{F}_{p}\cup\mathcal{G}_{p}\cup\mathcal{R}_{p}\}$ . That is, $T_{D}$ is the first time when the realization of interaction deviates from supports of all path-based feature functions, including the path-based component of the reward. Then based on the value of $T_{D}$ , we can partition $E_{P_{\boldsymbol{\lambda}}Q}[f^{i}_{t}(H_{t},M_{t})]$ as follows:

[TABLE]

After deviating from all supports, i.e., when $T_{D}\leq t$ , the distribution is the same as the case where only decomposable features functions have been included. Thus $E_{P_{\boldsymbol{\lambda}}Q}[f^{i}_{t}(H_{t},M_{t})|T_{D}\leq t]$ can be easily obtained within $O(T|\mathcal{H}||\mathcal{M}|)$ computations, by taking advantage of the one-step Markov property. Also, the distribution of the stopping time $T_{D}$ can be computed as:

[TABLE]

At most it requires $T(|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)$ computations. Same computation complexity is expected when computing $P_{\boldsymbol{\lambda}}Q(T_{D}>t)$ . For $E_{P_{\boldsymbol{\lambda}}Q}[f^{i}_{t}(H_{t},M_{t})|T_{D}>t]$ , we have:

[TABLE]

It’s easy to observe that it requires $O(T(|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|))$ computations, too. Then the computation complexity to compute the sum is $O(T^{2}(|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|))$ .

Exactly the same complexity is obtained when computing functions in $\mathcal{G}$ , $\mathcal{R}_{p}$ and $r^{d}(h^{T},m^{T})$ . Then the time complexity of one dual update will be given by the maximum of the two cases, as well as the the time to establish the distributional model, thus is given by $O(T(|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)\max(T,|\mathcal{H}||\mathcal{M}|))$ .

For the machine optimization problem, we do not need to carry out the dual update, since $\gamma$ is fixed throughout the iterations. Therefore we only need to establish the distributional model $\hat{Q}$ . By viewing the path-based part of the reward function as the path-based ‘feature’ in the machine optimization problem, we can easily conclude that both the space and time complexity in obtaining the machine’s policy $\hat{Q}$ is $O((|\mathcal{F}_{p}|+|\mathcal{G}_{p}|+|\mathcal{R}_{p}|)T|\mathcal{H}||\mathcal{M}|)$ . Moreover, as long as the initial machine policy is such that after deviating from the union of the supports of all path-based feature functions, it is one-step Markov, i.e. $\hat{Q}^{(0)}(m_{t}|m^{t-1},h^{t-1})=\hat{Q}^{(0)}(m_{t}|m_{t-1},h_{t-1})$ when $(m^{t-1},h^{t-1})\neq(\bar{m}_{i}^{t-1},\bar{h}_{i}^{t-1}),~{}\forall i\in\mathcal{F}_{p}\cup\mathcal{G}_{p}\cup\mathcal{R}_{p}$ , all the assumptions introduced in Theorem 5 are satisfied throughout the AREA iterations. A uniform random $\hat{Q}^{(0)}$ is a special case satisfying that condition.

Therefore the complexity of AREA algorithm is polynomial in $T$ as long as the number of dual updates is limited in the human estimation problem.

Appendix D Convergence of AREA under sufficient statistics

An implication of Theorem 1 is that, under our maximum entropy framework, estimated human models will be of the form given in the theorem for a given value of $\boldsymbol{\lambda}$ . We refer to such distributions as causally conditioned Gibbs distributions formally defined as follows:

Definition 5

Given the set of constraints $\mathcal{F}$ , $\mathcal{G}$ and underlying machine policy $Q(m^{T}\|h^{T})$ , we define the associated causally conditioned Gibbs distributions as

[TABLE]

where $P_{\boldsymbol{\lambda}}(h_{t}|h^{t-1},m^{t})$ is as given in Theorem 1. That is, each element in $\mathcal{P}_{g}(Q,\mathcal{F},\mathcal{G})$ is a causally conditioned distribution of the form given in Theorem 1 for a $\boldsymbol{\lambda}=(\boldsymbol{\lambda}_{f},\boldsymbol{\lambda}_{g})$ .

Remark: According to Hammersley-Clifford Theorem in [24], if the human behavior $P^{*}(h^{T}\|m^{T})$ has full support, i.e., there is no $(h^{T},m^{T})\in\mathcal{H}^{T}\times\mathcal{M}^{T}$ such that $P^{*}(h^{T}\|m^{T})=0$ , and machine’s policy $Q(m^{T}\|h^{T})$ also has full support, then there exists a pair of finite sets of constraint $\mathcal{F}^{*}$ and $\mathcal{G}^{*}$ such that the true human behavior is in the associated causally conditioned Gibbs distribution set, i.e., $P^{*}(h^{T}\|m^{T})\in\mathcal{P}_{g}(Q,\mathcal{F}^{*},\mathcal{G}^{*})$ .

Then if the features in human estimation problem are rich enough, the following theorem captures the convergence of AREA.

Theorem 8

If the feature sets $\mathcal{F}$ and $\mathcal{G}$ and initial machine policy $\hat{Q}^{(0)}$ are such that,

[TABLE]

AREA algorithm converges after the first iteration.

Proof:

When $P^{*}(h^{T}\|m^{T})\in\mathcal{P}_{g}(\hat{Q}^{(0)},\mathcal{F},\mathcal{G})$ , then $P^{*}(h^{T}\|m^{T})$ can be parameterized by some $\bm{\lambda}^{*}:=(\bm{\lambda}_{f}^{*},\bm{\lambda}_{g}^{*})$ and will be the solution to the human estimation problem, based on the data produced under machine policy $\hat{Q}^{(0)}$ . Now, given $\hat{P}^{(0)}=P^{*}$ , the machine optimization problem generates $\hat{Q}^{(1)}=m^{*}(m(P^{*}))$ . However since we have that $P^{*}(h^{T}\|m^{T})\in\mathcal{P}_{g}(m^{*}(m(P^{*})),\mathcal{F},\mathcal{G})$ , again we can ensure that $\hat{P}^{(1)}=P^{*}$ . By induction it is easy to see that $\hat{P}^{(n)}=P^{*},~{}\forall n\geq 0$ and $\hat{Q}^{(n)}=m^{*}(m(P^{*})),~{}\forall n\geq 1$ . Thus AREA iterations will converge after the first iteration. $\square$

Appendix E Proof of Theorem 6

At the $n$ th iteration, when matching the moment of path-based features, we have:

[TABLE]

After cancelling out $Q^{(n-1)}(\bar{m}^{i,T}\|\bar{h}^{i,T})$ on both sides we have

[TABLE]

If the feature set in Problem (4) also includes $\mathcal{F}_{p}^{i}$ , for all $i\in\mathcal{F}_{p}$ , by a similar argument we have that for all $i\in\mathcal{F}_{p}$ and $t=1,2,\ldots,T$ ,

[TABLE]

Thus $\hat{P}^{(n)}(\bar{h}_{i,t}|\bar{h}^{i,t-1},\bar{m}^{i,t})=P^{*}(\bar{h}_{i,t}|\bar{h}^{i,t-1},\bar{m}^{i,t})$ for all $i\in\mathcal{F}_{p}$ and $1\leq t\leq T$ .

When the reward function is a path-based function, a straightforward observation from Theorem 1 is that, the resulting machine policy $\hat{Q}(m_{t}|h^{t-1},m^{t-1})$ is uniformly random if $(h^{t-1},m^{t-1})\neq(\bar{h}^{r,t-1},\bar{m}^{r,t-1})$ . The machine policy along the support of the reward function is induced by:

[TABLE]

where $Y_{\gamma}^{t}:=Y_{\gamma}(h^{t},m^{t})$ for $(h^{t},m^{t})\neq(\bar{h}^{r,t},\bar{m}^{r,t})$ . The sequence of interactions can be suppressed because from Theorem 1 we can conclude that, after leaving the ‘profitable’ path, all $Y_{\gamma}$ will be the same, independent of corresponding $P(h^{T}\|m^{T})$ . From Eq. (20) we can prove by induction that $Y_{\gamma}(\bar{m}_{r,t}|\bar{h}^{r,t-1},\bar{m}^{r,t-1})$ does not change after the first iteration. Thus the resulted machine’s policy $\{\hat{Q}^{(n)}\}$ converges after the first iteration.

Appendix F Proof of Theorem 7

The solution to the $n$ th human estimation step can be written as $\hat{P}^{(n)}=h^{*}(\hat{Q}^{(n)},\mathbf{c}_{f}(\hat{Q}^{(n)}),\mathbf{c}_{g}(\hat{Q}^{(n)},\hat{P}^{(n-1)}))$ . Indeed, $\mathbf{c}_{f}(\hat{Q}^{(n)})=E_{P^{*}\hat{Q}^{(n)}}[\mathbf{f}(H^{T},M^{T})]$ depends on the true human behavior $P^{*}$ , the feature set $\mathcal{F}$ , and also the machine policy in use $\hat{Q}^{(n)}$ . However, throughout AREA iterations, $P^{*}$ and $\mathcal{F}$ are fixed. Thus for simplicity we write $\mathbf{c}_{f}$ as a function of $\hat{Q}^{(n)}$ . Similarly, we write $\mathbf{c}_{g}$ as a function of $\hat{Q}^{(n)}$ and $\hat{P}^{(n-1)}$ , where the only dependency on $\hat{P}^{(n-1)}$ is through the step dependent feature $c_{g}^{0,(n)}$ we have introduced. Moreover, a direct result of Lemma 2 in [14] showed that $\mathbf{c}_{g}(\hat{Q}^{(n)},\hat{P}^{(n-1)})$ is actually a function of $\mathbf{Y}_{\gamma}^{(n)}$ , which is the $\mathbf{Y}_{\gamma}$ associated with $\hat{Q}^{(n)}$ as defined in Theorem 4.

Lemma 2

During the $n$ th iteration of AREA, let us denote the $\mathbf{Y}_{\gamma}$ in the machine optimization problem by $\mathbf{Y}_{\gamma}^{(n)}$ . Then

[TABLE]

Proof:

This is just a special case of Lemma 2 in [14]. By plugging the recursive form defined in Theorem 4 we can prove it is true. $\square$

Therefore $\hat{P}^{(n)}$ is actually a function of $\mathbf{Y}_{\gamma}^{(n)}$ , because $\hat{Q}^{(n)}$ is naturally a function of $\mathbf{Y}_{\gamma}^{(n)}$ by Theorem 4, and $\mathbf{c}_{g}$ is independent of $\hat{P}^{(n-1)}$ given $\mathbf{Y}_{\gamma}^{(n)}$ by Lemma 2:

[TABLE]

In order to show convergence, it will be easier to study it in terms of the underlying variables $\mathbf{Y}_{\gamma}^{(n)}$ . In the sequel when there is no ambiguity we will denote it by $\hat{P}^{(n)}=h^{*}(\mathbf{Y}_{\gamma}^{(n)})$ .

Let us define the following function of $Y_{\gamma}$ :

[TABLE]

and $L^{(n)}=L(\mathbf{Y}_{\gamma}^{(n)})$ . Now we are ready to prove Theorem 7.

Proof:

In order to show the convergence of $\{L(\mathbf{Y}_{\gamma}^{(n)})\}$ , we define the following functions of $\mathbf{Y}_{\gamma}$ :

$c(\mathbf{Y}_{\gamma}|\mathbf{Y}_{\gamma}^{\prime})$ is the objective function of the machine’s optimization problem, where $\mathbf{Y}_{\gamma}$ and $\mathbf{Y}_{\gamma}^{\prime}$ are as defined in Theorem 4 and are associated with $Q(m^{T}\|h^{T})$ and previous machine’s policy $Q^{\prime}(m^{T}\|h^{T})$ ,

[TABLE] 2. 2.

$L(\mathbf{Y}_{\gamma})$ can be written as

[TABLE]

During the AREA algorithm there are two possible cases: (1) $\hat{Q}^{(n+1)}=\hat{Q}^{(n)}$ , and (2) $\hat{Q}^{(n+1)}\neq\hat{Q}^{(n)}$ . In case (1) it’s straightforward that $\hat{Q}^{(m)}$ will be the same as $\hat{Q}^{(n)}$ , for all $m\geq n$ . In case (2) we can show the convergence by proving the strict monotonicity of $\{L(\hat{Y}_{\gamma}^{(n)})\}$ as follows.

[TABLE]

Here Eq. (25) follows from the optimality of the solution to the machine’s optimization (5), and Eq. (26) follows by the definition of $L(\mathbf{Y}_{\gamma}^{(n)})$ . Thus we only need to show Eq. (24). Based on the definitions of the associated quantities, we have for all feasible $\mathbf{Y}_{\gamma}^{(n+1)}$ :

[TABLE]

The inequality holds true because in the human estimation problem, we introduced the constraint $E_{P\hat{Q}^{(n)}}[g^{0,(n)}(H^{T},M^{T})]\geq c_{g}^{0,(n)}$ . Also, due to the boundedness of both $\mathbb{H}_{h^{*}(\mathbf{Y}_{\gamma})m^{*}(\mathbf{Y}_{\gamma})}(M^{T}\|H^{T})$ and the expected reward function, $L(\mathbf{Y}_{\gamma})$ is also upper bounded. Therefore, the sequence generated by AREA recursion $\{L(\mathbf{Y}_{\gamma}^{(n)})\}$ converges monotonically. $\square$

An interesting observation we can make is that, $\{L(\mathbf{Y}_{\gamma}^{(n)})\}$ converges to a value associated with a fixed point of AREA iterations.

Theorem 9

$\{L(\mathbf{Y}_{\gamma}^{(n)})\}$ * converges to $L^{\infty}$ , and there exists a $\mathbf{Y}_{\gamma}^{\infty}$ such that $L(\mathbf{Y}_{\gamma}^{\infty})=L^{\infty}$ , and $\mathbf{Y}_{\gamma}^{\infty}$ is a fixed point of AREA iterations, i.e., $m^{*}(m(h^{*}(\mathbf{Y}_{\gamma}^{\infty})))=m^{*}(\mathbf{Y}_{\gamma}^{\infty})$ .*

Proof:

Now if we let the AREA algorithm stops once we observe $\hat{Q}^{(n+1)}=\hat{Q}^{(n)}$ , otherwise proceed to the next iteration, then throughout the iterations of AREA (except for the last step when we stop), machine optimization problem is strongly concave, thus obtain a unique maximum at any $n+1$ st step, which is $\hat{Q}^{(n+1)}\neq\hat{Q}^{(n)}$ . Therefore, Eq. (25) holds true strictly. Then $L(\mathbf{Y}_{\gamma}^{(n+1)})>L(\mathbf{Y}_{\gamma}^{(n)})$ in case (2). We can follow the result in [25], by defining the solution set as the set of $\mathbf{Y}_{\gamma}$ such that $m^{*}(m(h^{*}(\mathbf{Y}_{\gamma})))=m^{*}(\mathbf{Y}_{\gamma})$ , i.e., the set of fixed point of AREA iterations, Corollary 1-1 in [25] shows that one of the following statement is true:

The iteration stops in finite steps. Then we know it corresponds to the case where we have for some $n$ , $\hat{Q}^{(n+1)}=\hat{Q}^{(n)}$ . Thus $\forall~{}m>n$ , $\hat{Q}^{(m)}=\hat{Q}^{(n)}$ , implying $\{\hat{Q}^{(n)}\}$ converges. 2. 2.

The iteration does not stop. Then according to Corollary 1-1 in [25], any convergent subsequence of $\{\mathbf{Y}_{\gamma}^{(n)}\}$ , say $\{\hat{\mathbf{Y}}_{\gamma}^{(k)}:k\in\mathcal{K}_{j}\subseteq\mathbb{Z}^{+}\}$ converges to an accumulation point $\hat{\mathbf{Y}}_{\gamma}^{(\infty),j}$ as $k\rightarrow\infty$ , such that $\hat{\mathbf{Y}}_{\gamma}^{(\infty),j}$ is within the solution set.

Therefore, due to the convergence of $\{L(\mathbf{Y}_{\gamma}^{(n)})\}$ , all the accumulation points of $\{\mathbf{Y}_{\gamma}^{(n)}\}$ have the same value of $L(\mathbf{Y}_{\gamma})$ function, and are fixed points of AREA iterations. $\square$

Appendix G One important special case: Decomposable Features

In this section we discuss AREA under a special family of features. Specifically, we will derive performance guarantees for the case where the solution has a special structure.

From now on we shall make the following assumption.

Assumption 1

Reward function $r(h^{T},m^{T})$ is also used as a feature function in the estimation phase. Also, $\forall i\in\mathcal{F}$ , $f^{i}(h^{T},m^{T})$ is decomposable, including the reward function $r(h^{T},m^{T})$ , and $\forall i\in\mathcal{G}$ , $g^{i}(h^{T},m^{T})$ is also decomposable.

Lemma 3

Under Assumption 1, the solution to the machine’s optimization phase has no dependency across time $t$ . That is, at the $n$ th iteration:

[TABLE]

Moreover,

[TABLE]

Note that under such assumptions, $E_{\hat{P}^{(n-1)}\hat{Q}^{(n)}}[r(H_{t},m_{t})]$ only depends on $\hat{P}^{(n-1)}$ .

Proof:

This can be proved in a manner similar to Lemma 1. Specifically, we can show that when Assumption 1 is true,

[TABLE]

where $Y_{\gamma,t}:=\sum_{m_{t}}e^{\gamma\sum_{h_{t}}\hat{P}^{(n-1)}(h_{t}|m_{t})r_{t}(h_{t},m_{t})}$ .

The Markov property of $\hat{P}^{(n-1)}$ follows from Lemma 1 and this identity holds true trivially when $t=T$ , and can be proved by induction for other cases. $\square$

Suppose our task is to find a machine’s policy associated with a $\mathbf{Y}_{\gamma}$ to maximize $L(\mathbf{Y}_{\gamma})$ defined in Eq. (21). In general, such an objective function is not well-defined in $Q$ because $Y_{\gamma}$ is not a function of $Q$ . However, when Assumption 1 takes effect, the causally conditional entropy is not dependent on $\hat{P}^{(n-1)}$ :

[TABLE]

where $E_{\hat{P}^{(n-1)}Q}[-\log Q_{t}(M_{t})]$ actually does not depend on $\hat{P}^{(n-1)}$ , and we always have

$E_{h^{*}(\mathbf{Y}_{\gamma})Q}[r(H^{T},M^{T})]=E_{P^{*}Q}[r(H^{T},M^{T})]$ . In the sequel when Assumption 1 is true we will use the notation $\mathbb{H}_{Q}(M^{T}\|H^{T})$ where $P$ is suppressed. Then $L(\mathbf{Y}_{\gamma})$ is actually a function of $Q$ , where $Q=m^{*}(\mathbf{Y}_{\gamma})$ as it can be written as

[TABLE]

And still we are able to show the strict monotonicity of $\{L(\hat{Q}^{(n)})\}$ .

Moreover, we can show such objective function is indeed concave.

Theorem 10

When Assumption 1 is true, $L(Q)$ is strongly concave with parameter $|\mathcal{M}|^{T}$ in $Q(m^{T}\|h^{T})$ .

Proof:

It’s easy to observe that $E_{P^{*}Q}[r(H^{T},M^{T})]$ is affine. We already know that the causally conditional entropy term is strongly concave in $Q$ when $\hat{P}^{(n-1)}$ is fixed. Now we know that when Assumption 1 is true, the causally conditional entropy term is independent of $\hat{P}^{(n-1)}$ . Then it is a strong concave function in $Q$ . $\square$

Theorem 11

When Assumption 1 is true, $\{L^{(n)}\}:=\{L(\hat{Q}^{(n)})\}$ converges to some limit $L^{\infty}$ . If $Q^{*}$ is the global maximizer of

[TABLE]

then

[TABLE]

Proof:

First, according to Theorem 9, $L^{\infty}$ must be $L(Y^{\infty}_{\gamma})$ where $Y^{\infty}_{\gamma}$ is a fixed point.

The only difference between Eq.(5) and Eq.(28) is that in Eq.(28), the mean reward is induced by $h(Q)$ which is a function of $Q$ and in Eq.(5), that is induced by $\hat{P}$ which is fixed. The gradient of $L(Q)$ is given by:

[TABLE]

Here we suppress the human model $\hat{P}$ in the entropy term because the entropy is independent of the human model.

And for the Eq.(5) at a fixed point, we have:

[TABLE]

Thus at the fixed point, i.e. when $Q=Q^{\infty}$ ,

[TABLE]

Also, from the moment-matching constraint and Assumption 1 we know $E_{h^{*}(Q)Q}[r(H^{T},M^{T})]=E_{P^{*}Q}[r(H^{T},M^{T})]$ . Thus we have

[TABLE]

which is also the gradient of $L(Q)$ at the fixed point.

Then according to the strong concavity, we have

[TABLE]

$\square$

Appendix H Numerical evaluation set-up

H.1 Leaky, Competing Accumulator

In the simulation set-up we use a discrete-time version of the original continuous-time version devised in [23]. The Leaky, Competing Accumulator model consists of a set of accumulators $X_{t}(h)$ for $h\in\mathcal{H}$ at time $t$ , representing the tendency of picking $h$ . The evolution of $X_{t}(h)$ is driven by following parameters: (1) A self decay coefficient $\alpha$ , capturing the forgetting effect of human memory; (2) An inhibitory coefficient $\beta$ , capturing the negative impact of the belief in one option to others; (3) Intensity/strength of the external stimuli, $\rho$ , modeling the amount of increment an external stimulus can bring to the associated accumulator. (4) Power of noise $\sigma^{2}$ , modeling the randomness in human decisions. At each time $t$ , the recursion of accumulators is given by

[TABLE]

where $S_{t}$ stands for the external stimulus at time $t$ , and $N_{t,h}$ is an i.i.d. Gaussian noise. Then the human will pick the action with the highest value of $X_{t}(h)$ at $t$ .

In our setting we use $\alpha=0.1,\beta=0.2,\rho=0.4,\sigma^{2}=0.09$ , and the accumulators are all initialized at 0, so at the very beginning human pick responses uniformly randomly.

H.2 Q-learning

The detailed update rule for $Q$ function is as follows. After picking $m_{t}$ and observing human’s response $h_{t}$ at time $t$ , we do

[TABLE]

The $\alpha$ in Q-learning is the learning rate and $\delta$ is the discount factor to balance the weight between current and future reward. In our evaluation, we set $\alpha=0.1$ and $\delta=0.8$ , which are values commonly used.

In our simulations, the Q-learning picks its action according softmax of the associated $Q$ function, which means when it observes the latest interactions $(h_{t-\tau+1},\ldots,h_{t},m_{t-\tau+1},\ldots,m_{t})$ and $t+1$ , it picks a response $m\in\mathcal{M}$ with probability

[TABLE]

In our simulation we pick $c=10$ so that Q-learning achieves a comparable average reward as AREA after first 100 samples.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sissie Ling-Ie Hsiao, Chao Cai, Eric W Ewald, Cameron M Tangney, A Walker II Robert, Japjit Tulsi, Ming Lei, and Zhimin He. Conversion path performance measures and reports, January 26 2016. US Patent 9,245,279.
2[2] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
3[3] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control . Athena Scientific, 2nd edition, 2000.
4[4] Nicole Bäuerle and Ulrich Rieder. Markov decision processes with applications to finance . Springer Science & Business Media, 2011.
5[5] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research , 4:237–285, 1996.
6[6] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction , volume 1. MIT press Cambridge, 1998.
7[7] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , 5(1):1–122, 2012.
8[8] Bennet B Murdock Jr. The serial position effect of free recall. Journal of experimental psychology , 64(5):482, 1962.