Estimating Rationally Inattentive Utility Functions with Deep Clustering   for Framing - Applications in YouTube Engagement Dynamics

William Hoiles; Vikram Krishnamurthy

arXiv:1812.09640·cs.LG·December 27, 2018

Estimating Rationally Inattentive Utility Functions with Deep Clustering for Framing - Applications in YouTube Engagement Dynamics

William Hoiles, Vikram Krishnamurthy

PDF

Open Access

TL;DR

This paper introduces a deep learning framework to estimate utility functions and information costs of rationally inattentive agents, applying it to analyze YouTube user commenting behavior and decision-making processes.

Contribution

It develops a novel inverse reinforcement learning method incorporating Renyi divergence to estimate attention strategies and utility functions from behavioral data.

Findings

01

Successfully applied to YouTube data to characterize user commenting behavior.

02

Provides a constructive way to estimate utility and information costs from observed decisions.

03

Demonstrates the importance of framing and attention strategies in behavioral modeling.

Abstract

We consider a framework involving behavioral economics and machine learning. Rationally inattentive Bayesian agents make decisions based on their posterior distribution, utility function and information acquisition cost Renyi divergence which generalizes Shannon mutual information). By observing these decisions, how can an observer estimate the utility function and information acquisition cost? Using deep learning, we estimate framing information (essential extrinsic features) that determines the agent's attention strategy. Then we present a preference based inverse reinforcement learning algorithm to test for rational inattention: is the agent an utility maximizer, attention maximizer, and does an information cost function exist that rationalizes the data? The test imposes a Renyi mutual information constraint which impacts how the agent can select attention strategies to maximize…

Equations71

p (x ∣ s) = \frac{μ ( x ) α ( s ∣ x )}{y \in X \sum μ ( y ) α ( s ∣ y )} .

p (x ∣ s) = \frac{μ ( x ) α ( s ∣ x )}{y \in X \sum μ ( y ) α ( s ∣ y )} .

a^{*} \in a \in A argmax E {u (x, a) ∣ s} = a \in A argmax {x \in X \sum p (x ∣ s) u (x, a)} \forall p (x ∣ s) \in S (α)

a^{*} \in a \in A argmax E {u (x, a) ∣ s} = a \in A argmax {x \in X \sum p (x ∣ s) u (x, a)} \forall p (x ∣ s) \in S (α)

\displaystyle\alpha^{*}(s|x)\in\operatorname*{argmax}_{\alpha}\Big{\{}\mathbb{E}_{s\in\mathcal{S}(\alpha)}\{\operatorname*{max}_{a\in\mathcal{A}}[\sum_{x\in\mathcal{X}}p(x|s)u(x,a)]\}-C(\mu,\alpha)\Big{\}}

\displaystyle\alpha^{*}(s|x)\in\operatorname*{argmax}_{\alpha}\Big{\{}\mathbb{E}_{s\in\mathcal{S}(\alpha)}\{\operatorname*{max}_{a\in\mathcal{A}}[\sum_{x\in\mathcal{X}}p(x|s)u(x,a)]\}-C(\mu,\alpha)\Big{\}}

D = {(x_{t}, f_{t}, a_{t})}_{t = 1}^{T} .

D = {(x_{t}, f_{t}, a_{t})}_{t = 1}^{T} .

\overset{π}{^} (a ∣ x, f) = \frac{\sum _{t = 1}^{T} 1 { x _{t} = x , a _{t} = a , f _{t} = f }}{1 { x _{t} = x , f _{t} = f }}, \overset{μ}{^} (x) = \frac{1}{T} t = 1 \sum T 1 {x_{t} = x}

\overset{π}{^} (a ∣ x, f) = \frac{\sum _{t = 1}^{T} 1 { x _{t} = x , a _{t} = a , f _{t} = f }}{1 { x _{t} = x , f _{t} = f }}, \overset{μ}{^} (x) = \frac{1}{T} t = 1 \sum T 1 {x_{t} = x}

V (π (a ∣ x, f)) = k = 1 \sum K x \in X \sum a \in A_{k} \sum π_{k} (a ∣ x, f) μ (x) u (a, x, f) .

V (π (a ∣ x, f)) = k = 1 \sum K x \in X \sum a \in A_{k} \sum π_{k} (a ∣ x, f) μ (x) u (a, x, f) .

L = ∣∣ s - g (f (w (s) + ε)) ∣ ∣_{2}^{2} + KL (P ∣∣ Q)

L = ∣∣ s - g (f (w (s) + ε)) ∣ ∣_{2}^{2} + KL (P ∣∣ Q)

q_{t n} = \frac{( 1 + ∣∣ z _{t} - Ψ _{n} ∣ ∣ ^{2} ) ^{- 1} )}{\sum _{n = 1}^{N} ( 1 + ∣∣ z _{t} - Ψ _{n} ∣ ∣ ^{2} ) ^{- 1}} \forall n \in {1, \dots, N} .

q_{t n} = \frac{( 1 + ∣∣ z _{t} - Ψ _{n} ∣ ∣ ^{2} ) ^{- 1} )}{\sum _{n = 1}^{N} ( 1 + ∣∣ z _{t} - Ψ _{n} ∣ ∣ ^{2} ) ^{- 1}} \forall n \in {1, \dots, N} .

p_{t n} = \frac{q _{t n}^{2} / F _{n}}{\sum _{n = 1}^{N} ( q _{t n}^{2} / F _{n} )}, F_{n} = t = 1 \sum T q_{t n} .

p_{t n} = \frac{q _{t n}^{2} / F _{n}}{\sum _{n = 1}^{N} ( q _{t n}^{2} / F _{n} )}, F_{n} = t = 1 \sum T q_{t n} .

π_{k} (a ∣ x, f) = s \in S (α_{k}) \sum α_{k} (s ∣ x, f) η_{k} (a ∣ s), S (α_{k}) = {p_{k} (x ∣ a, f) : a \in A_{k}}

π_{k} (a ∣ x, f) = s \in S (α_{k}) \sum α_{k} (s ∣ x, f) η_{k} (a ∣ s), S (α_{k}) = {p_{k} (x ∣ a, f) : a \in A_{k}}

x \in X \sum p_{k} (x ∣ a, f) [u (x, a, f) - u (x, b, f)] \geq 0 \forall a, b \in A_{k} \forall f \in {1, \dots, N}

x \in X \sum p_{k} (x ∣ a, f) [u (x, a, f) - u (x, b, f)] \geq 0 \forall a, b \in A_{k} \forall f \in {1, \dots, N}

p_{k} (x ∣ a, f) = \frac{μ ( x ) π _{k} ( a ∣ x , f )}{\sum _{y \in X} μ ( y ) π _{k} ( a ∣ y , f )}

k = 1 \sum K G_{k, k} - G_{k + 1, k} \geq 0

k = 1 \sum K G_{k, k} - G_{k + 1, k} \geq 0

G_{k, w} = s \in S (α_{k}) \sum x \in X \sum μ (x) α_{k} (s ∣ x, f) b \in A_{w} max {x \in X \sum s (x) u (x, b, f)}

α_{k} (s ∣ x) = a \in A_{k} \sum π_{k} (a ∣ x, f) 1 {p_{k} (x ∣ a, f) = s}, with A_{K + 1} = A_{1} .

L (u (x, a, f)) for f \in {1, 2, \dots, N}

L (u (x, a, f)) for f \in {1, 2, \dots, N}

G_{k, k} - C (μ, α_{k}) \geq G_{w, k} - C (μ, α_{w}) \forall k, w \in {1, \dots, K}

G_{k, k} - C (μ, α_{k}) \geq G_{w, k} - C (μ, α_{w}) \forall k, w \in {1, \dots, K}

I_{β} (μ, α_{k}) = ⎩ ⎨ ⎧ \frac{1}{β - 1} ln (x \in X \sum a \in A \sum \frac{p ^{β} ( x , a )}{μ ^{β - 1} ( x ) p ^{β - 1} ( a )}) β \in (0, 1) \cup (1, \infty) I (μ, α_{k}) β = 1 - ln (x \in X \sum a \in A \sum μ (x) p (a) \mathds 1 {p (x, a) > 0}) β = 0

I_{β} (μ, α_{k}) = ⎩ ⎨ ⎧ \frac{1}{β - 1} ln (x \in X \sum a \in A \sum \frac{p ^{β} ( x , a )}{μ ^{β - 1} ( x ) p ^{β - 1} ( a )}) β \in (0, 1) \cup (1, \infty) I (μ, α_{k}) β = 1 - ln (x \in X \sum a \in A \sum μ (x) p (a) \mathds 1 {p (x, a) > 0}) β = 0

\displaystyle p_{k}^{*}(x,a)\in\operatorname*{argmax}_{p(x,a)}\Big{\{}\sum_{a\in\mathcal{A}_{k}}\sum_{x\in\mathcal{X}}p(x,a)u(x,a)\Big{\}}

\displaystyle p_{k}^{*}(x,a)\in\operatorname*{argmax}_{p(x,a)}\Big{\{}\sum_{a\in\mathcal{A}_{k}}\sum_{x\in\mathcal{X}}p(x,a)u(x,a)\Big{\}}

s.t. μ (x) = a \in A_{k} \sum p (x, a) \forall x \in X

s.t. I_{β} (μ, α_{k}) \leq κ_{max}, p (x, a) \geq 0 \forall x \in X, a \in A_{k} .

u (x, a) = \frac{λ _{1}}{β - 1} η^{β - 1} (x, a) E [η^{β - 1} (x, a)] - λ_{2}

u (x, a) = \frac{λ _{1}}{β - 1} η^{β - 1} (x, a) E [η^{β - 1} (x, a)] - λ_{2}

\frac{1}{β - 1} ln (E [η^{β - 1} (x, a)]) = κ_{max}, η (x, a) = \frac{p ( x ∣ a )}{p ( x )}

F_{Π} = {f_{π, k} : X \times A_{k} \times N \to [0, 1]}, f_{π, k} = M \frac{π _{k} ( a ∣ x , f )}{π ^ _{k} ( a ∣ x , f )} u (x, a, f) = M \overset{u}{ˉ} (π_{k} (a ∣ x, f))

F_{Π} = {f_{π, k} : X \times A_{k} \times N \to [0, 1]}, f_{π, k} = M \frac{π _{k} ( a ∣ x , f )}{π ^ _{k} ( a ∣ x , f )} u (x, a, f) = M \overset{u}{ˉ} (π_{k} (a ∣ x, f))

V (π_{k}) \leq \hat{V} (π_{k}) + λ \frac{Var [ u ˉ ( π _{k} )]}{T _{k}} + \frac{15 λ ^{2}}{18 M ( T _{k} - 1 )}

V (π_{k}) \leq \hat{V} (π_{k}) + λ \frac{Var [ u ˉ ( π _{k} )]}{T _{k}} + \frac{15 λ ^{2}}{18 M ( T _{k} - 1 )}

π (a ∣ x, f) \in π_{k} \in Π arg max ⎩ ⎨ ⎧ k = 1 \sum K V (π_{k} (a ∣ x, f)) - \overset{ˉ}{λ}_{k} \frac{Var [ u ˉ ( π _{k} ( a ∣ x , f ))]}{T _{k}} ⎭ ⎬ ⎫

π (a ∣ x, f) \in π_{k} \in Π arg max ⎩ ⎨ ⎧ k = 1 \sum K V (π_{k} (a ∣ x, f)) - \overset{ˉ}{λ}_{k} \frac{Var [ u ˉ ( π _{k} ( a ∣ x , f ))]}{T _{k}} ⎭ ⎬ ⎫

s.t. a \in A_{k} \sum π_{k} (a ∣ x, f) = 1, π_{k} (a ∣ x, f) \geq 0

L (u (a, x, f), π_{k} (a ∣ x, f)) \forall x \in X, \forall a \in A_{k}, \forall k \in {1, \dots, K}, \forall f \in {1, \dots, N} .

x \in X \sum p_{k} (x ∣ a, f) [u (x, a, f) - u (x, b, f)] \geq 0

x \in X \sum p_{k} (x ∣ a, f) [u (x, a, f) - u (x, b, f)] \geq 0

k = 1 \sum K a \in A_{k} \sum p_{k} (a, f) m_{k} (a, f) - a \in A_{k + 1} \sum p_{k + 1} (a, f) n_{k + 1} (a, f) \geq 0

m_{k} (a, b, f) = x \in X \sum p_{k} (a ∣ x, f) u (x, b, f)

m_{k} (a, f) \geq m_{k} (a, b, f) \forall a, b \in A_{k}

m_{k} (a, f) \leq m_{k} (a, b, f)_{M} (1 - δ_{b, f}), b \in A_{k} \sum δ_{b, f} = 1

n_{k + 1} (a, f) \geq m_{k + 1} (a, b, f) \forall a \in A_{k + 1} \forall b \in A_{k}

n_{k + 1} (a, f) \leq m_{k + 1} (a, b, f)_{M} (1 - ζ_{b, f}), b \in A_{k} \sum ζ_{b, f} = 1

u (x, a) \in [0, 1], δ_{b, k}, ζ_{b, k} \in {0, 1}

\forall a, b \in A_{k}, c \in A_{k + 1}, \forall k \in {1, 2, \dots, K}

G_{k, k} - G_{w, k} \geq C (μ, α_{k}) - C (μ, α_{w})

G_{k, k} - G_{w, k} \geq C (μ, α_{k}) - C (μ, α_{w})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpinion Dynamics and Social Influence · Misinformation and Its Impacts · Advanced Bandit Algorithms Research

Full text

Estimating Rationally Inattentive Utility Functions with Deep Clustering for Framing - Applications in YouTube Engagement Dynamics

William Hoiles,

[email protected]

&Vikram Krishnamurthy

Cornell University [email protected]

Abstract

We consider a framework involving behavioral economics and machine learning. Rationally inattentive Bayesian agents make decisions based on their posterior distribution, utility function and information acquisition cost (Rényi divergence which generalizes Shannon mutual information). By observing these decisions, how can an observer estimate the utility function and information acquisition cost? Using deep learning, we estimate framing information (essential extrinsic features) that determines the agent’s attention strategy. Then we present a preference based inverse reinforcement learning algorithm to test for rational inattention: is the agent an utility maximizer, attention maximizer, and does an information cost function exist that rationalizes the data? The test imposes a Rényi mutual information constraint which impacts how the agent can select attention strategies to maximize their expected utility. The test provides constructive estimates of the utility function and information acquisition cost of the agent. We illustrate these methods on a massive YouTube dataset for characterizing the commenting behavior of users.

1 Introduction

Suppose a Bayesian agent chooses an action at each time instant to maximize an expected utility function based on the noisy measurement of an underlying state. Assume that obtaining this noisy measurement is expensive – this information acquisition cost affects the action chosen by the agent. An observer records the dataset of actions of the Bayesian agent and knows the underlying state. How can the observer estimate the utility function and information acquisition cost of the agent given this dataset? Our aim is to construct preference based inverse reinforcement learning algorithms to obtain set valued estimates of the utility and information acquisition cost that are consistent with the dataset.

Our methodology stems from behavioural economics and machine learning: non-parametric estimation of utility functions and feature extraction using deep clustering to construct behavioural-economics based models for Bayesian agents. Let us briefly explain these two aspects. Estimating utility functions given a finite length time series of decisions is well studied in the area of revealed preferences in economics [36, 41] and more recently in machine learning. Also, costly information acquisition by Bayesian agents has been studied by economists and psychologists under the area of “rational inattention” pioneered by Sims [30, 31]. Rational inattention is a form of bounded rationality - the key idea is that human attention spans for information acquisition are limited and can be modelled in information theoretic terms as a Shannon capacity limited communication channel. However, modelling the information acquisition process is complicated in our case by framing. In behavioural economics, Kahneman uses “frames” to describe information an agent has when making a decision. For example, when selecting which product to purchase on a website, the positioning of the products and surrounding content on the website impacts how humans select a product. Given external information (image/text/numeric) in which the decision problem is embedded, how can one construct a tractable feature set? We develop deep embedded clustering methods to construct the frames to test for rational inattentive agents. The deep embedded clustering is based on [42, 12], however we design the input, encoder, and decoder to account for the visual perception of the frame of the decision problem which includes image, text, and numeric information.

Context: (i) Rational Inattention & Inverse Reinforcement Learning. Sim’s rational inattention model is studied extensively in behavioral economics [24]. Woodford [41] considered an upper bound on the Shannon capacity for testing rational inattention with visual perception queues. Typically, the information acquisition costs faced by a decision maker are not known to the observer. A general test for rational inattention is proposed in [6, 5] with minimal restrictions on the information acquisition cost. The two significant extensions considered in this paper are the effects of framing (determined using deep embedded clustering) and the use of Rényi mutual information cost constraints for testing rational inattention. Our rational inattention test is equivalent to solving the temporal credit assignment problem in preference-based inverse reinforcement learning [39]. Such inverse reinforcement learning is used with non-numeric feedback [40], e.g. in socially adaptive path planning [19, 13] for robots.

Context: (ii) YouTube Application. We will use rational inattention and framing (with deep learning) on a massive YouTube data set to analyse the commenting behaviour of users in YouTube. Extensive studies [18, 22, 1] show that comments posted by users are influenced by the thumbnail, title, category, and perceived popularity of each video. In our formulation, frames are associated with the videos thumbnail and title; the decision-problem with the category; and the perceived popularity with the underlying state. The commenting behavior (agent’s actions) is related to the number and sentiment of the comments that result from the framing information, state, and decision-problem faced by the agent. Based on extensive data analysis, our main take-home message (from a behavioral economics point of view) is that YouTube users are rationally inattentive in their commenting behavior; moreover users prefer to comment on videos that are perceived to be popular; see Sec.7 for additional conclusions.

Organization. Sec.2 introduces the problem formulation. Sec.3 discusses a deep embedded clustering algorithm for associating the observed agent’s action to specific frames. In Sec.4 and 5, the decision test for rational inattention with Rényi mutual information acquisition cost are provided. The tests are constructive: they provide estimates of the utility function, information acquisition cost, and attention strategy. Sec .6 provides Bernstein based finite sample performance bounds. Sec.7 applies the methods to a massive YouTube dataset to characterize the commenting behavior of users. The appendix summarizes the implementation details of the deep classifier.

2 Problem Formulation and Rational Inattention

We first describe the problem formulation first from the point of view of the rationally inattentive agent; and then from the point of view of the observer that views the dataset generated by the agent. Despite our abstract formulation, the reader should keep in mind the YouTube context outlined above.

Viewpoint 1. Rationally Inattentive Bayesian Agent

Assume the agent knows the finite state space $\mathcal{X}$ and finite action space $\mathcal{A}$ . The agent’s prior beliefs of the possible states are given by the prior probability distribution $\mu(x)$ , $x\in\mathcal{X}$ . The attention function $\alpha(s|x)$ of the agent provides a distribution over the signals $s\in\mathcal{S}(\alpha)$ when the state is $x$ . The set of possible signals $\mathcal{S}(\alpha)$ for a given attention strategy $\alpha$ is finite. The attention function encodes all the information (signals, private information, and measurement mechanism) available to the agent to compute the posterior state distribution. Given the prior $\mu(x)$ , and attention function $\alpha(s|x)$ , the Bayesian agent computes the posterior distribution as

[TABLE]

The agent has utility function $u(x,a)$ over the states $x\in\mathcal{X}$ and actions $a\in\mathcal{A}$ .

Definition 1.

An agent satisfies attention rationality if it selects actions $a\in\mathcal{A}$ and attention functions $\alpha(s|x)$ that satisfy the following conditions (where $\mathbb{E}$ denotes the expectation operator):

i)

Expected Utility Maximization:

[TABLE] 2. ii)

Attention Selection Rationality:

[TABLE]

where $C(\mu,\alpha)$ is the cost (or disutility) of attention function $\alpha$ when the prior distribution $\mu$ .

Eq.(2) states that the agent selects actions that are consistent with Bayesian utility maximization, and (i) states that the agent selects the best attention strategy to maximize the gross expected utility.

Viewpoint 2. Observer’s Model and Deep Clustering of Frames

By observing the actions of the agent, the observer aims to determine if the agent is rationally inattentive, and if so, estimate the agent’s utility function and information acquisition cost. The observer has access to the dataset of states $x_{t}$ and actions $a_{t}$ chosen by the agent for time $t=1,\ldots,T$ :

[TABLE]

Here the parameter $f_{t}$ represents all the framing information immediately apparent to the agent. Typically, framing information $f_{t}$ includes images, video, text, and data. In our YouTube example, $f_{t}$ maps the title and thumbnail of a video to an integer representing a unique frame. Qualitatively, different values of $f_{t}$ determine different action policies by the agent for a given title and thumbnail. A major challenge when applying rational inattention theory is accounting for the agent’s framing effects that impact the agent’s behaviour. To account for framing effects, we assume there are $\{0,1,\dots,N\}$ possible frames. In Sec.3 a deep embedded clustering method is used to construct $f_{t}$ given the title and thumbnail of the YouTube video observed at time $t$ .

Given the set of frames, rational inattention theory aims is to determine if the dataset $\mathcal{D}$ is consistent with a rational agent (Definition 1). To test for rational inattention we require estimates of the (possibly randomized) action selection policy $\pi(a|x,f)$ and prior beliefs $\mu(x)$ of the agent. Using $\mathcal{D}$

[TABLE]

are maximum likelihood estimates of these, where $\mathbf{1}\{\cdot\}$ is the indicator function. Given the maximum likelihood estimates (5), Sec.4 provides a decision test for rational inattention. For agents that satisfy the rational inattention test, methods to recover their utility function $u(x,a,f)$ , attention strategy $\alpha(s|x)$ , posterior distribution $s(x)$ , and information cost $C(\mu,\alpha)$ are provided.

For a rationally inattentive agent, it is desirable to have a risk-aware method to optimize the expected utility of the agent by adjusting their action selection policy $\pi(a|x,f)$ while keeping the attention strategy $\alpha(s|x)$ (measurement device) unchanged. The expected utility of a rationally inattentive agent for action-selection policies $\boldsymbol{\pi}(a|x,f)=\{\pi_{k}(a|x,f)\}_{k=1}^{K}$ over $K$ decision problems is

[TABLE]

In Sec.6 a penalized variance optimization method is presented for constructing action selection policies that maximize (6). The construction uses finite sample bounds on the total expected utility.

3 Constructing Preference and Policy Invariant Frames via Deep Learning

Here a deep embedding method is provided that learns the policy invariant frames of the agent. Specifically, a mapping of $f_{t}$ to $n_{t}\in\{1,\dots,N\}$ is constructed where for each $n\in\{1,\dots,N\}$ the behavior of the agent is invariant. In the YouTube social network the framing information available to the agent is comprised of the title and thumbnail of each video. Given that agents are ordinal preference invariant to minor variations in the title and thumbnail, it is possible to map the features $f_{t}$ to one of $\{1,\dots,N\}$ discrete frames learned using deep embedding.

The deep embedding method uses an autoencoder to construct the latent representation $z_{t}$ of $f_{t}$ , and includes a clustering layer to simultaneously learn how to associate each $f_{t}$ to one of $\{1,\dots,N\}$ discrete frames. A schematic of the clustering method is illustrated in Fig. 1.

The autoencoder comprises two deep neural networks, the first is the encoder that maps the input $f_{t}$ to the latent space representation $z_{t}$ , and the second is the decoder that map the latent space representation $z_{t}$ to the input $f_{t}$ where $\hat{f}_{t}\approx f_{t}$ . To force the encoder to learn robust latent representations, the autoencoder is trained using corrupted versions of the input. Such an autoencoder is known as a denoising autoencoder [37, 4]. The denoising autoencoder encodes the input into the latent space representation, and attempts to remove the effect of the corruption process stochastically applied to the input of the autoencoder. Removing effects of the corruption process is performed by learning the statistical dependencies between the inputs. A detailed description of the denoising autoencoder architecture is in the Appendix with focus on the title and thumbnail of YouTube videos.

Though the latent space representation of the input has been used extensively for clustering, such methods are not guaranteed to preserve any intrinsic local structure of the framing data $f_{t}$ . To ensure the autoencoder both minimizes the reconstruction error and maximizes the intrinsic local structure of the data, a clustering loss is used. The loss of the deep embedded clustering method (Fig. 1) is:

[TABLE]

where $\operatorname{KL}(P||Q)$ is the Kullback-Leibler (KL) divergence of the discrete probability distributions $P$ and $Q$ . Here $Q$ is the prior probability distribution of cluster association between the latent variables $z_{t}$ and the associated frames $n_{t}$ . If we assume each cluster is generated from a Gaussian normal distribution with mean $\Psi_{n}$ , then the probability of association of each $z_{t}$ is given by the Student-t distribution:

[TABLE]

Given $Q$ , the distribution $P$ is designed to avoid degenerate clustering solutions which allocate most of the frames to a few clusters or assign a cluster to a sample outlier.

[TABLE]

$P(z_{t}=n)=F_{n}/T$ is the probability that $z_{t}$ belonging to cluster $n$ ; $F_{n}$ is the clustering frequency.

From (8) and (9), if all the data-points are associated with a specific cluster this will increase the loss (7). Additionally, if the cluster is associated with several data points with low-confidence, this will also increase the loss (7). Minimizing the loss (7) can be interpreted as a form of self-training as $P$ depends on $Q$ . Specifically, in self-training we take an initial classifier and an unlabeled dataset, then label the dataset with the classifier in order to train on its own high confidence predictions. This ensures that the latent clusters are constructed to avoid outliers.

The deep embedding method that maps $f_{t}$ to $n_{t}\in\{1,\dots,N\}$ is formalized in Algorithm 2. The pretraining step is used to initialize the encoder and decoder parameters prior to performing any clustering. This is a critical step as the initial latent space representation of $\{f_{t}\}_{t=1}^{T}$ is used to select the approximate locations of the $N$ latent space cluster centers $\Psi^{o}$ . Given the pretrained denoising autoencoder weights, we use the Lloyd heuristic algorithm to select the locations of the $N$ latent space cluster centers $\Psi^{o}$ . Given the cluster centers, the deep clustering method is applied to minimize the loss (7) by simultaneously adjusting the cluster associations and autoencoder weights. Note that in Algorithm 2, since the distribution $P$ (9) depends on the weights of the encoder, we update $P$ after $\zeta$ iterations. This reduces the probability of instability associated with cycling between adjusting weights and cluster associations. The final result of Algorithm 2 is achieved when the change in cluster associations is below a threshold $\delta$ . To ensure only frames $f_{t}$ that can be confidently associated to one invariant frame, all frames that fail to satisfy $\operatorname{max}\{q_{tn}\}\leq\delta_{c}$ are discarded.

Given the preference and policy invariant frames $\{n_{t}\}_{t=1}^{T}$ , we substitute $n_{t}\rightarrow f_{t}$ in $\mathcal{D}$ (4). Using $\mathcal{D}$ with the invariant frames, Sec.4 and Sec.5 illustrate how to detect if the agent is rationally inattentive for different information cost constraints, and how to recover the utility functions.

4 Decision Test for Rational Inattention; Estimating Utility/ Attention Costs

Here we construct a decision test for rational inattention (Definition 1). The resulting preference-based inverse reinforcement learning algorithm uses the observed stochastic choice dataset $\mathcal{D}$ (4) and invariant frames $\{n_{t}\}_{t=1}^{T}$ . Theorem 1 is our main result and generalizes [5, 6]:

Theorem 1.

Dataset $\mathcal{D}$ (4) satisfies rational inattention (Definition 1) iff the action policy satisfies

[TABLE]

where the choice function $\eta_{k}(a|s)$ is the probability of selecting action $a$ given the posterior associated with signal $s\in\mathcal{S}(\alpha_{k})$ . Additionally, one of the following two conditions must be satisfied.

i)

The utility $u(x,a,f)$ satisfies the following inequalities for decision problems $k=1,\ldots,K$ :

[TABLE]

Also, the attention function $\alpha_{k}(s|x,f)$ for each decision problem $k=1,\dots,K$ satisfies

[TABLE] 2. ii)

A utility function $u(x,a,f)$ exists that satisfies the constraints

[TABLE]

where the mixed integer linear constraint set $\mathcal{L}$ is defined in the Supporting Material.

In Theorem 1, (1) ensures that the attention function $\alpha_{k}(s|x,f)$ and action selection policy $\eta_{k}(a|s)$ are consistent with the observed action-selection policy $\pi_{k}(a|x,f)$ (5). The inequalities (10) ensure that the agent satisfies Bayesian expected utility maximization. Intuitively, if the expected utility of taking action $a$ is higher then action $b$ , then $u(x,a,f)\geq u(x,b,f)$ . Additionally, the utility function must satisfy “cyclical consistency” in which ordinal relation cycles such as $u(x,a,f)\geq u(x,b,f)>u(x,c,f)>u(x,a,f)$ are not present. For readers familiar with revealed preference theory, this is analogous to the GARP conditions in Afriat’s theorem [36, 7] for testing utility maximization behavior. The constraints (i) ensures the optimal attention function is selected by the agent for each decision problem. Qualitatively, $G_{k,w}$ gives the expected utility of using attention strategy $\alpha_{k}(s|x,f)$ . The constraints (20) in Theorem 1 provides a method to simultaneously test if the agent is rationally inattentive, and to recover the ordinal utility $u(x,a,f)$ of their associated preferences. The evaluation involves determining if a feasible solutions exists for a set of mixed-integer linear constraints.

Notice that Theorem 1 places no restrictions on the information cost $C(\mu,\alpha)$ of using attention function $\alpha$ when the prior is $\mu$ . That is, if the constraints (20) are satisfied then the constraints

[TABLE]

are guaranteed to be feasible. The constraints (13) ensure that the selected attention function $\alpha_{k}$ is optimal for the associated decision problem $(\mathcal{X},\mu,\pi_{k}(a|x,f),\mathcal{A}_{k})$ . The constraints (13) can be used to recover set valued estimates of cost structure of the attention functions via a set of linear constraints, refer to the Supporting Material.

5 Rényi Entropy Information Acquisition Cost for Rational Inattention

In this section we impose a specific structure to the information acquisition cost which defines the attention strategy of a rationally inattentive agent. Sims’ pioneering work [31] uses Shannon mutual information, here the more general Rényi mutual information is considered. The Rényi mutual information between the prior $\mu(x)$ of the state and the selected attention strategy $\alpha_{k}(s|x)$ is

[TABLE]

where $\beta\in[0,\infty)$ is the Rényi order. An important feature of (14) is that for $\beta\in[0,1]$ the information constraint is convex in the arguments $p(x,a)$ and $\mu(x)p(a)$ , and for $\beta>1$ the information constraint is convex in $\mu(x)p(a)$ and quasi-convex in $p(x,a)$ [34, 14, 43].

The Rényi entropy is useful for measuring the information acquisition cost since the parameter $\beta$ allows one to adjust the sensitivity of the cost to the shape of $\mu(x)$ and $\alpha_{k}(s|x)$ . Indeed, Rényi entropy of order $\beta$ includes the Hartley entropy, Shannon entropy, collision entropy and min entropy as special cases. In terms of (2), the Rényi information cost constrained decision problem is

[TABLE]

In (15), $\kappa_{\text{max}}$ represents the maximum “effort” the agent is willing to invest to estimate the state $x\in\mathcal{X}$ prior to taking the action $a\in\mathcal{A}_{k}$ in decision problem $k\in\{1,\dots,K\}$ .

Given that the objective function is linear and the constraint set is convex in (15) for $\beta\in[0,1]$ , necessary and sufficient conditions for the agent to satisfy rational inattention with the Rényi information cost constraint can be constructed using the Karush-Kuhn-Tucker (KKT) conditions. Formally:

Theorem 2.

A rationally inattentive agent with utility function $u(x,a)$ , observed joint-distribution $p(x,a)$ , and $\beta\in(0,1)$ satisfies Rényi mutual information cost (14) if and only if there exists constants $\lambda_{1}>0$ and $\lambda_{2}$ that satisfy the linear constraints

[TABLE]

for all $x\in\mathcal{X}$ , $a\in\mathcal{A}$ where $\mathbb{E}[\cdot]$ is the expected value taken over the joint-distribution $p(x,a)$ . ∎

In Theorem 2, $\lambda_{1},\lambda_{2}$ are KKT multipliers of the Rényi cost information constraint and equality constraint in (15). Combining the linear equality constraints in Theorem 2 with the mixed integer linear program (20), yields a test for the Rényi information cost constraint and provides estimates of the associated utility function of the agent. Thus we have constructed a preference based inverse reinforcement learning algorithm for the utility and information acquisition cost of a Bayesian agent.

6 Finite Sample Performance Analysis of the Agent’s Action-Selection Policy

Thus far we have constructed estimates for an agent’s utility function and information acquisition cost by observing the agents behavior. Indeed, the maximum likelihood estimate of the agent’s action-selection policy is $\hat{\pi}(a|x,f)$ (5). An important question related to performance analysis of these estimators is: How far is the net utility obtained using this estimated policy (based on a finite dataset) compared to the actual net utility $V(\boldsymbol{\pi}(a|x,f))$ (6) which uses the true policy $\boldsymbol{\pi}(a|x,f)$ ?

Using an extension of the empirical Bernstein inequality to the space of continuous function classes

[TABLE]

we can construct a finite sample bound between the observed net utility $V(\hat{\boldsymbol{\pi}}(a|x,f))$ and an estimate of the net utility $V(\boldsymbol{\pi}(a|x,f))$ for the unobserved policy $\boldsymbol{\pi}(a|x,f)$ . In (17), $M$ is a normalization constant which ensures $f_{\pi,k}\in[0,1]$ , $\hat{\pi}_{k}(a|x,f)$ is the observed policy (5), and $\pi_{k}(a|x,f)$ is an unobserved policy. By bounding the function class (17) using the uniform covering number and employeeing the double-sampling method [2], Theorem 3 results.

Theorem 3.

Let $\bar{u}(\pi_{k})$ be a random variable with $T_{k}$ i.i.d. samples in $\mathcal{D}$ . Then with probability $1-\gamma$ the random vector $(a_{t},x_{t})\sim\pi_{k}$ , for a stochastic hypothesis class $\pi_{k}\in\Pi$ , $T_{k}\geq 16$ , and $\lambda=\sqrt{18\operatorname{ln}(10\mathcal{N}_{\infty}\{1/T_{k},\mathcal{F}_{\Pi},2T_{k}\}/\gamma)}$ , satisfies

[TABLE]

where $\mathcal{N}_{\infty}\{1/T_{k},\mathcal{F}_{\Pi},2T_{k}\}/\gamma)$ is the uniform covering number. ∎

Theorem 3 provides a probabilistic bound between the estimated net utility $\hat{V}(\pi_{k})$ and actual net utility $V(\pi_{k})$ that only depends on the dataset $\mathcal{D}$ and the coefficient $\lambda$ . Therefore, for constructing the true policy $\boldsymbol{\pi}(a|x,f)$ , one would maximize the net utility $\hat{V}(\pi_{k})$ while minimizing the variance term with a coefficient $\bar{\lambda}\geq 0$ . Note that in Theorem 3 $\lambda$ encodes the entropy of the function class $\mathcal{F}_{\Pi}$ , which is dependent on the number of samples $T_{k}$ , uniform covering number $\mathcal{N}_{\infty}\{\cdot\}$ , and $\gamma$ which is a measure of the confidence of the estimate. For the function class (17), $\mathcal{N}_{\infty}\{\cdot\}$ is polynomial in the sample size $T_{k}$ [25, 35, 29]–this ensures as the sample size increases that $\hat{V}(\pi_{k})\rightarrow V(\pi_{k})$ .

Using the insights from Theorem 3, the mixed integer-linear program

[TABLE]

can be used to construct the optimal policy $pi_{k}(a|x,f)$ that maximizes the net utility $V(\boldsymbol{\pi}(a|x,f))$ while ensuring the policy is consistent with rational inattention. The regularization term $\bar{\lambda}_{k}$ in (19) balances the maximization of the net utility $V(\boldsymbol{\pi}(a|x,f))$ while accounting for the finite-sample variance associated with estimating $V(\boldsymbol{\pi}(a|x,f))$ for policies $\boldsymbol{\pi}(a|x,f)$ that are different from $\hat{\boldsymbol{\pi}}(a|x,f)$ . The lower the value of $\bar{\lambda}_{k}$ , the more risk-seeking the generated optimal policy.

7 Rational Inattention & Utility Maximization in YouTube Social Network

Constructing utility based preference models for how users interact and consume content in online social media platforms is important in social network analysis [18, 22]. YouTube is an interesting example of a social network since the interaction between users includes video content. Users interact on YouTube channels by posting comments and rating videos. Extensive empirical studies [18, 22, 1, 16, 15, 3] show that comments and ratings from users are influenced by the thumbnail, title, category, and perceived popularity of each video. Here we consider a massive YouTube dataset comprising 6 million videos across 25,000 channels and over a millions users from April 2007 to May 2015. As is typical in behavioral economics [38], by user behavior, we mean the average commenting behavior per YouTube channel, averaged over all the channels.

First, we constructed ordinal preference invariant frames using deep embedded clustering Algorithm 2. Recall that Algorithm 2 maps the high dimensional title and thumbnail space to one of $N$ unique frames. Here we chose $N=4$ and the embedding space to have dimension $200$ . The shape of the resulting embedding space is displayed in Fig. 2. Selecting $N=4$ ensures each video is sufficiently isolated to a particular frame; less than 3% of videos are classified ambiguously in terms of frames.

Next, for each of the preference invariant frames in Fig. 2, we apply the rational inattention test to determine if users are rationally inattentive. We find that the commenting behaviour of users in YouTube is consistent with rational inattention for a general cost constraint. The ordinal utility of the users in each unique frame is provided in Fig. 3. As expected, the commenting behaviour of the users is different between each frame. Additionally, the users prefer to comment on videos that are expected to have a higher popularity compared with videos with lower popularity. If we impose the Rényi information cost constraint, we find that only the commenting behaviour in frame $f=4$ is rationally inattentive. The associated utility however provides no clear preference ordering between the popularity of the video and the associated commenting behaviour. This suggests that users are rationally inattentive with respect to a general information cost constraint.

Discussion. From a behavioral economics point of view, the above results yield useful insight into user behavior in online social multimedia. Based on extensive analysis of the YouTube dataset, our main conclusions are that users commenting behavior (number of comments and comment sentiment) is i) consistent with rational inattention, ii) depends on the framing information available iii) users prefer to comment on videos that are perceived to be popular, iv) the category of the video influences the commenting behavior; see Supporting Material. That deep clustering adequately captures framing information, and that a preference based utility with attention costs rationalizes the YouTube dataset is remarkable. We speculate that this approach can be used to predict popularity of YouTube channels.

There is also considerable scope to generalize the utility function estimation described in this paper to stopping time problems involving partially observed Markov decision processes [20, 21].

Appendix A Appendix. Denoising Autoencoder Architecture for YouTube Title and Thumbnail

A detailed description of the steps in the deep embedding method for constructing the preference invariant frames is provided in Algorithm 1, reproduced here in greater detail then in the main paper. The denoising autoencoder is comprised of stacked long short term memory (LSTM) and convolutional neural network (CNN) which are detailed in Sec.A.1 and Sec.A.2. To ensure the denoising autoencoder is robust to variations in the title and thumbnail input (e.g. good generalization performance), we introduce noise into the input training data. Possible methods to introduce noise into the network include using drop-out [32] and drop-path [17] methods. Here we apply Gaussian noise to the input images and numeric representation of the words, and additionally include drop-out layers in the LSTM and CNN networks.

A.1 Text Processing of the YouTube Title

The design of autoencoders for text data is challenging as a result of the power-law distribution of words and the long-range dependencies (grammars) between words. To address these challenges, we use previously constructed word embeddings to convert the words into a numeric vector. We then employ a LSTM networks for the encoder and decoder blocks of the autoencoder which focus on text processing. The combination of using word embeddings and LSTMs allows the network to utilize prior knowledge of similar words while simultaneously learning how to cluster similar sentences into a unique frame.

Prior to transforming the words into their numeric embedding, we apply a lemmatization transformation. Lemmatization reduces the number of variations of words necessary to consider as it groups all the inflected forms a word into a single base representation. For example, the verb “to walk” may appear as “walk”, “walked”, “walks”, “walking” which are all converted to “walk” via the lemmatization transformation. To perform the lemmatization transformation we use the WordNet lemmatizer 111https://wordnet.princeton.edu/wordnet/. The WordNet lemmatizer is comprised of two resources, a set of rules which identify the inflectional endings that can be detached from individual words, and a list of exceptions for irregular word forms. WordNet first checks the exceptions, then remove any inflectional endings from the words. Having performed the lemmatization operation, we now construct numeric vector representations of the words. A popular method to perform this task is to use distributed representations of words (e.g. word embeddings). The distributed representation of words in a vector space are designed such that words with similar semantic meaning have similar latent space representations. Equivalently, words with similar meaning will cluster tgeother in the word embedding space. Two popular word embeddings are the Word2Vec [26] and Glove [28] models. For the clustering algorithm we use the Glove embedding that was constructed using over 2 billion tweets and is comprised of over 1.2 million words. The possible dimension of the word embedding space is 25, 50, 100, or 200. Here we use a word embedding dimension of 25.

Given the word embeddings of the sentence $w(f)$ , we use an LSTM encoder-decoder framework to learn latent space representations of the titles [10, 33, 11, 9]. To construct the latent space representation of the sentences, we utilize a stacked LSTM architecture. Note that stacked LSTMs are able to capture grammatical information in the title at different scales. It was illustrated in [10, 33] that stacked LSTMs tend to have superior predictive performance compared to single layer LSTMs for natural language processing tasks.

A.2 Image Processing of the YouTube Thumbnail

In the denoising autoencoder, image processing is performed using a VGG based architecture. Given the latent space representation $z_{t}$ from the encoder, the image decoder is used to reconstruct the original input image. To perform this task requires the use of deconvolution and upsampling layers. However, deconvolution layers are not used in CNN autoencoders. Instead a mixture of convolutional and upsampling layers are employed. In the most extreme case, a single upsampling layer can be used to directly reconstruct the images from the latent space as illustrated in [23]. A commonly used method is to construct multiple transposed convolution (also known as fractionally strided convolutions) layers in combination with upsampling layers. Using the transposed convolution layers instead of the standard convolution layers ensures that “checkerboard” artifacts are removed from the decoded image [27].

Appendix B Constraint Set $\mathcal{L}(u(x,a,f)$ for Rational Inattention (Theorem 1) and Recovery of Utility and Information Cost

To construct the utility function $u(x,a,f)$ of the agent for the observed stochastic dataset $\mathcal{D}$ (4) requires that the utility satisfies the inequalities (10) for Bayesian utility maximization, and (11) for attention function maximization. The utility function $u(x,a,f)$ of a rationally inattentive agent must satisfy the following mixed-integer linear constraints:

[TABLE]

with $\mathcal{A}_{K+1}=\mathcal{A}_{1}$ and $M$ a large constant. To determine if a $u(x,a,f)$ exists for the constraint set can be evaluated using a variety of numerical methods including branch-and-bound, cutting planes, branch-and-cut, and branch-and-price [8].

Given the utility function $u(x,a,f)$ from the solution of (12), and the inequality relation (13), an ordinal estimate of the associated cost of information $C(\mu,\alpha_{k})$ of each attention strategy $\alpha_{k}$ can be constructed. Specifically, the ordinal cost of information $C(\mu,\alpha_{k})$ can be computed by solving the following linear program:

[TABLE]

Recall that if a solution to (12) exists, then a solution to (21) is guaranteed to exist from Theorem 1 and (3). Notice that if the cost of a particular attention strategy is zero, then absolute bounds can be placed on the information cost of each attention strategy. For example if $C(\mu,\alpha_{w})=0$ , then the cost $C(\mu,\alpha_{k})\in[G_{k,w}-G_{w,w},G_{k,k}-G_{w,k}]$ . The estimated cost function satisfies weak monotonicity in information–that is, if the attention function provides more information then it will have a higher information cost. However, it may be the case that the actual cost of information used by the agent does not satisfy this condition. In fact, only requiring rational inattention with no further restrictions on information cost does not impose any testable conditions for information monotonicity.

Appendix C Estimating the Agent’s Attention Function and Choice Function

If the dataset $\mathcal{D}$ satisfies rational inattention, it is also possible to estimate the agent’s attention function $\alpha_{k}(s|x)$ and choice function $\eta_{k}(a|s)$ .

To construct the agent’s attention function $\alpha_{k}(s|x)$ and choice function $\eta_{k}(a|s)$ requires the posterior distribution $p_{k}(x|a)$ . First, consider the signal set $\mathcal{S}(\alpha_{k})$ of all observed posterior state distributions of the agent for attention function $\alpha_{k}(s|x)$ using

[TABLE]

Each posterior distribution $p_{k}(x|a)$ is associated with a single signal $s\in\mathcal{S}(\alpha_{k})$ . The posterior distribution $p_{k}(x|a)$ in (22) is equal to the true posterior distribution $p_{k}(x|s)$ in (1) only if the choice function $\eta_{k}(a|s)$ produces a single action $a\in\mathcal{A}_{k}$ for each $s\in\mathcal{S}(\alpha_{k})$ with probability one. Otherwise the posterior distribution $p_{k}(x|a)$ is given by the weighted sum

[TABLE]

Note that without explicit knowledge of the choice and attention functions of the agent, the stochastic choice dataset can not be used to determine if $p_{k}(x|a)=p_{k}(x|s)$ . Having $p_{k}(x|a)=p_{k}(x|s)$ is not required to determine if the agent satisfies rational inattention.

Given $p_{k}(x|a)$ , for each signal $s\in\mathcal{S}(\alpha_{k})$ , the associated attention function is

[TABLE]

where the second equality results from using the data matching condition in Theorem 1. Note that (24) is only equal to the agent’s attention function $\rho_{k}(r|x)$ if the observed and true posterior distributions are equal. If $\rho_{k}(r|x)$ is the true attention function then

[TABLE]

It must be the case that the observed attention strategy $\alpha_{k}(s|x)$ is weakly less informative than the true attention strategy $\rho_{k}(r|x)$ . Equivalently, the observed attention strategy is a noisy version of the true attention strategy. Theorem 1 however does not require we know the true attention strategy $\rho_{k}(r|x)$ of the agent to test if the agent’s behavior satisfies rational inattention.

The observed choice function of the agent is given by

[TABLE]

which is merely the ratio of the number of times action $a\in\mathcal{A}_{k}$ was selected over all other possible actions $b\in\mathcal{A}_{k}$ for the prior distribution $s\in\mathcal{S}(\alpha_{k})$ . The observed choice function provides no information on the true choice function over the posterior distributions $r\in\Gamma(\rho_{k})$ that result from the true attention function unless the actual and observed posterior distributions are equal. Note however that the observed attention function $\alpha_{k}(s|x)$ (24) and choice function $\eta_{k}(a|s)$ (26) are consistent with the agent’s observed action-selection policy $\pi_{k}(a|x)$ as required in the data matching requirement of Theorem 1.

Appendix D YouTube Dataset and Definition of the Frames, Context, Action, and Decision-Problem

To construct $\mathcal{D}$ , we use the real-world YouTube dataset comprising 6 million videos across 25,000 channels from April 2007 to May 2015. The YouTube data contains the view counts, comment counts, likes, dislikes, thumbnail, title, and category of each video. The frame instance $f_{t}$ of each video is comprised of the video’s thumbnail and title. Specifically, we use a $40\times 80$ pixel color image to represent the thumbnail (which is a resized version of the native $246\times 138$ pixel thumbnails used in YouTube). For the title, we only include the first 8 words of the title in the framing instance $f_{t}$ (over 90% of the videos have a title of length 8 words or less). The top category of videos in the YouTube dataset is “Gaming” which comprises 44% of all the videos. Two decision-problems are considered in the dataset. The first is $k=1$ which is associated with all videos that have category “Gaming”, while decision-problem $k=2$ results for videos that are not associated with the “Gaming” category. The state $x_{t}$ of each video is associated with the viewcount of the video 14 days after the video was published. Specifically, state $x=1$ is high viewcount where the viewcount is above 10,000 views, while $x=2$ results otherwise. The associated action $a_{t}$ is related to the commenting behavior of the agents, which is computed using the comment counts, like count, and dislike count 2 days after the video is published. The possible actions $a=1$ is low comment count with negative sentiment, $a=2$ is low comment count with neutral sentiment, $a=3$ is low comment count with positive sentiment, $a=4$ is high comment count with negative sentiment, $a=5$ is high comment count with neutral sentiment, and $a=6$ is high comment count with positive sentiment. Here negative sentiment results if the difference in like count and dislike count is below -25, neutral sentiment if the difference is between -25, 25, and has positive sentiment if the difference is above 25. A low comment count is considered if there are less then 100 comments, and high otherwise.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Alhabash, J. Baek, C. Cunningham, and A. Hagerstrom. To comment or not to comment?: How virality, arousal level, and commenting behavior on youtube videos affect civic behavioral intentions. Computers in human behavior , 51:520–531, 2015.
2[2] M. Anthony and P. Bartlett. Neural network learning: Theoretical foundations . cambridge university press, 2009.
3[3] A. Aprem and V. Krishnamurthy. Utility change point detection in online social media: A revealed preference framework. IEEE Transactions on Signal Processing , 65(7), April 2017.
4[4] Y. Bengio. Learning deep architectures for AI. Foundations and trends® in Machine Learning , 2(1):1–127, 2009.
5[5] A. Caplin and M. Dean. Revealed preference, rational inattention, and costly information acquisition. The American Economic Review , 105(7):2183–2203, 2015.
6[6] A. Caplin and D. Martin. A testable theory of imperfect perception. The Economic Journal , 125(582):184–202, 2015.
7[7] W. Diewert. Afriat’s theorem and some extensions to choice under uncertainty. The Economic Journal , 122(560):305–331, 2012.
8[8] K. Genova and V. Guliashki. Linear integer programming methods and approaches–a survey. Journal of Cybernetics and Information Technologies , 11(1), 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Estimating Rationally Inattentive Utility Functions with Deep Clustering for Framing - Applications in YouTube Engagement Dynamics

Abstract

1 Introduction

2 Problem Formulation and Rational Inattention

Viewpoint 1. Rationally Inattentive Bayesian Agent

Definition 1**.**

Viewpoint 2. Observer’s Model and Deep Clustering of Frames

3 Constructing Preference and Policy Invariant Frames via Deep Learning

4 Decision Test for Rational Inattention; Estimating Utility/ Attention Costs

Theorem 1**.**

5 Rényi Entropy Information Acquisition Cost for Rational Inattention

Theorem 2**.**

6 Finite Sample Performance Analysis of the Agent’s Action-Selection Policy

Theorem 3**.**

7 Rational Inattention & Utility Maximization in YouTube Social Network

Appendix A Appendix. Denoising Autoencoder Architecture for YouTube Title and Thumbnail

A.1 Text Processing of the YouTube Title

A.2 Image Processing of the YouTube Thumbnail

Appendix B Constraint Set L(u(x,a,f)\mathcal{L}(u(x,a,f)L(u(x,a,f) for Rational Inattention (Theorem 1) and Recovery of Utility and Information Cost

Appendix C Estimating the Agent’s Attention Function and Choice Function

Appendix D YouTube Dataset and Definition of the Frames, Context, Action, and Decision-Problem

Definition 1.

Theorem 1.

Theorem 2.

Theorem 3.

Appendix B Constraint Set $\mathcal{L}(u(x,a,f)$ for Rational Inattention (Theorem 1) and Recovery of Utility and Information Cost