Local non-Bayesian social learning with stubborn agents
Daniel Vial, Vijay Subramanian

TL;DR
This paper investigates how stubborn agents spreading false information can disrupt social learning, revealing that initial correct beliefs can be overwritten over time, and proposes strategies to mitigate such influence.
Contribution
It introduces a non-Bayesian social learning model with stubborn agents and analyzes how misinformation persists, providing new strategies to counteract fake news influence.
Findings
Agents learn the true state initially but forget it over time.
Seeding stubborn agents can effectively disrupt correct learning.
Proposed strategies outperform heuristics in preventing misinformation.
Abstract
We study a social learning model in which agents iteratively update their beliefs about the true state of the world using private signals and the beliefs of other agents in a non-Bayesian manner. Some agents are stubborn, meaning they attempt to convince others of an erroneous true state (modeling fake news). We show that while agents learn the true state on short timescales, they "forget" it and believe the erroneous state to be true on longer timescales. Using these results, we devise strategies for seeding stubborn agents so as to disrupt learning, which outperform intuitive heuristics and give novel insights regarding vulnerabilities in social learning.
| Name | Description | Nodes | Edges |
|---|---|---|---|
| Gnutella | Peer-to-peer network | 6,301 | 20,777 |
| Wiki-Vote | Wiki admin elections | 7,115 | 103,689 |
| Pokec | Slovakian social network | 1,632,803 | 30,622,564 |
| LiveJournal | Blogging platform | 4,847,571 | 68,993,773 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Local non-Bayesian social learning with stubborn agents
Daniel Vial
Vijay Subramanian
\IEEEmembershipSenior Member, IEEE We are grateful for financial support from the NSF (grants EPCN:1603861 and CIF:AF:2008130) and LGE Inc. via Mcity. D. Vial is with the University of Texas, Austin, TX (email: [email protected]). V. Subramanian is with the University of Michigan, Ann Arbor, MI (email: [email protected]).
Abstract
We study a social learning model in which agents iteratively update their beliefs about the true state of the world using private signals and the beliefs of other agents in a non-Bayesian manner. Some agents are stubborn, meaning they attempt to convince others of an erroneous true state (modeling fake news). We show that while agents learn the true state on short timescales, they “forget” it and believe the erroneous state to be true on longer timescales. Using these results, we devise strategies for seeding stubborn agents so as to disrupt learning, which outperform intuitive heuristics and give novel insights regarding vulnerabilities in social learning.
1 Introduction
With the rise of social networks, people increasingly receive news through non-traditional sources. One recent study shows that two-thirds of American adults have gotten news through social media [1]. Such news sources are fundamentally different than traditional ones like print media and television, in the sense that social media users read and discuss news on the same platform. As a consequence, users turning to these platforms for news receive information not only from major publications but from others users as well; in the words of [2], a user “with no track record or reputation can in some cases reach as many readers as Fox News, CNN, or the New York Times.” This phenomenon famously reared its head during the 2016 United States presidential election when fake news stories were shared tens of millions of times [2], and it has remained a critical issue in 2020 [3].
In this paper, we study a mathematical model describing this situation. The model includes a set of agents attempting to learn the true state of the world (e.g. which of two candidates is better suited for office). Each agent iteratively updates its belief (i.e. its distribution over possible states) in a manner similar to the non-Bayesian social learning model of [4] using information from three sources. First, each agent receives noisy observations of the true state, modeling e.g. news stories. Second, each agent observes the beliefs of a subset of other agents, modeling e.g. discussions with other social media users. Third, each agent may observe the beliefs of stubborn agents or bots who aim to persuade others of an erroneous true state, modeling e.g. users spreading fake news.111The term stubborn agents has been used in the literature to describe such agents; the term bots is used in reference to automated social media accounts spreading fake news while masquerading as real users [5]. This process continues iteratively until a finite learning horizon.
Under this model, two competing forces emerge as the learning horizon grows. On the one hand, agents receive more observations of the true state, which help them learn. On the other hand, the beliefs of the bots gradually propagate through the system, suggesting that agents become increasingly exposed to bots and thus less likely to learn. Hence, while the horizon clearly affects the learning outcome, the nature of this effect – namely, whether learning becomes more or less likely as the horizon grows – is less clear.
This effect of the learning horizon has often been ignored in works with models similar to ours. For example, our model is nearly identical to that in the empirical work [6], in which the authors show that polarized opinions can arise when there are two types of bots with diametrically opposed viewpoints. However, the experiments in [6] simply fix a large learning horizon and do not consider the effect of varying it. Models similar to ours have also been treated analytically in e.g. [4, 7, 8, 9], but these works consider infinite horizons and/or cooperative settings (i.e. no stubborn agents). See Section 5 for details on these (and other) works.
In the first part of the paper (see Section 3), we argue that the learning horizon plays a prominent role when stubborn agents are present and should not be ignored. In particular, we show that the learning outcome depends on the relationship between the horizon and a quantity that describes the “density” of bots in the system, where both quantities may vary with the number of agents . Mathematically, letting denote the true state and the mean of the belief (hereafter, the estimate) for a uniformly random agent at the horizon , we show (see Theorem 1)222The theorem also addresses the case .
[TABLE]
Here is smaller when more bots are present and 0 is the erroneous true state promoted by the bots. Hence, in words, (1) says the following: if there are sufficiently few bots, in the sense that , learns the true state; if there are sufficiently many bots, in the sense that , adopts the extreme estimate 0 promoted by the bots. Additionally, we show the belief of converges to a Dirac measure in a certain sense (see Corollary 1).
We note the result in (1) assumes a particular random graph model for the social network connecting agents and bots (a modification of the so-called directed configuration model). For such models, phase transitions – wherein small changes to model parameters lead to starkly different behaviors – are often observed. In this case, assuming for some , and also assuming , the learning outcome suddenly drops from to [math] as changes from e.g. to . Put differently, agents initially (at time ) learn the true state, then suddenly (at time ) “forget” the true state and adopt the extreme estimate [math]. Hence, we show the horizon can lead to drastically different outcomes. We also note proving (1) involves analyzing hitting probabilities for random walks on random graphs with absorbing states (bots in our setting), which may be of independent interest.
In the second part of the paper (see Section 4), we study a setting in which an adversary chooses how many bots to connect to each agent, subject to a budget constraint. The adversary would like to minimize (i.e. to convince agents of the erroneous state [math]), but this quantity depends on the graph topology, which is not publicly available for social networks like Twitter. Hence, motivated by (1), we formulate the adversary’s problem as minimizing , which only depends on the degrees in the graph – e.g. number of followers on Twitter, which is publicly available. We clarify that is monotone in only as for the random graph of Section 3 (see Theorem 1). Thus, we use as a tractable (albeit nonrigorous) surrogate for the true objective function , and we show empirically that these quantities are closely correlated for real social networks (see Figure 2). Alternatively, given a target , we can minimize the horizon when this target estimate is reached. However, we view as fixed and thus do not pursue this dual problem.
Minimizing amounts to solving an integer program, which can be done in polynomial time owing to the structure of . However, the computational complexity is , which is infeasible for social networks like Twitter. Thus, we propose a randomized approximation algorithm that runs in time and that produces a constant-fraction approximation of the optimal solution with high probability (see Theorem 2). Moreover, whereas the logic of the optimal solution is somewhat opaque, the form of our approximate solution offers the interpretation that successful adversaries carefully balance agents’ influence and susceptibility to influence. For a social network like Twitter, this means targeting users with many followers (i.e. influential users) who follow very few users themselves, so that fake news will occupy a greater portion of the targeted users’ feeds. While somewhat intuitive, the precise form of the randomized scheme is far from obvious. Furthermore, empirical results show that our scheme disrupts learning to a larger extent than schemes that more obviously balance influence and susceptibility. Thus, we believe our analysis provides new insights into vulnerabilities of news sharing platforms and non-Bayesian social learning models.
The paper is organized as follows. In Section 2, we define our learning model. Sections 3 and 4 follow the outline above. We discuss related work in Section 5.
Notational conventions: The following notation is used frequently. For , we let , and for we let . All vectors are treated as row vectors. We let denote the vector with 1 in the -th position and 0 elsewhere. We denote the set of nonnegative integers by . We use for the indicator function, i.e. if is true and 0 otherwise. All random variables are defined on a common probability space , with denoting expectation, denoting convergence in probability, and meaning almost surely.
2 Learning model
We begin by defining the model of social learning studied throughout the paper. The basic ingredients are (1) a true state of the world, (2) a social network connecting two sets of nodes, some who aim to learn the true state and some who wish to persuade others of an erroneous true state, and (3) a learning horizon. We discuss each in turn.
The true state of the world is a constant . For example, in an election between candidates representing two political parties (say, Party 1 and Party 2), and means the Party 1 and 2 candidates are superior, respectively. We emphasize that is a deterministic constant and depends neither on time, nor on the number of nodes in the system.
A directed graph connects disjoint sets of nodes and . We refer to elements of as regular agents, or simply agents, and elements of as stubborn agents or bots. While agents attempt to learn the true state , bots aim to disrupt this learning and convince agents that the true state is instead 0. In the election example, agents represent voters who study the two candidates to learn which is superior, while bots are loyal to Party 1 and aim to convince agents that the corresponding candidate is superior (despite possible evidence to the contrary). Edges in the graph represent connections in a social network over which nodes share beleifs in a manner that will be described shortly. An edge means that observes ’s belief. Let and ; we assume .
Agents and bots share beliefs until a learning horizon . We will allow the horizon to depend on the number of agents and will thus denote it by at times. In the election example, represents the duration of the election, i.e. the number of time units that agents can learn about the candidates and that bots can attempt to convince agents of the superiority of the Party 1 candidate.
Given these basic ingredients, we can define the learning process. At time , agent has a belief, where and for some that do not depend on . For each , receives the signal . In the absence of a network, the Bayesian approach dictates that update its parameters to and and its belief to , namely, for any (measurable) ,
[TABLE]
In our running example, and represent the number of news stories favorable to respective parties that has read during the election, plus some prior parameters and that account for ’s biases from before the election. As grows, the belief converges to a Dirac measure on its mean ; intuitively, becomes increasingly confident that the true state is the fraction of stories favorable to a certain party.
In the presence of a network, we proceed in the same manner, except the parameters are updated as follows:
[TABLE]
where . Intuitively, reads the news and calculates its favorability of the parties as before, then discusses with its neighbors to update its favoribility. Mathematically, performs a Bayesian parameter update and then averages parameters. [6] uses the same update, whereas agents in [4] do Bayesian belief updates and then average beliefs. Our update also resembles the deGroot model [10], where there are no signals and estimates are averaged across neighbors. See Section 5.
Finally, we specify bot behavior. For , we set , , , and , then iteratively define via (3). More explicitly, a simple inductive proof shows
[TABLE]
In our running example, , , and means ’s prior parameters and signals are maximally biased toward Party 1. Furthermore, we can interpret as bots being “echo chambers” who only listen to themselves. Finally, note that since all bots have the same behavior, we assume (without loss of generality) that the outgoing neighbor set of is for some , i.e. in addition to its self-loop, each bot has a single outgoing neighbor from the agent set.
3 Learning outcome
To begin our analysis of the learning outcome, we show when all agents are (pathwise) connected to bots, their beliefs converge to those of the bots. Formally, for , let
[TABLE]
denote the -Wasserstein distance for probability measures and , where means and have respective marginals and . For , let denote the Dirac measure for measurable . We then have the following (see Appendix 10 for a proof).
Proposition 1
Suppose that for any , there exists and such that , , and . Then for any and ,
Hence, for a large enough horizon, estimates and beliefs become arbitrarily close to zero. A natural follow-up question is how such a horizon scales – and in which graph parameters – for a sequence of graphs . In this section, we address this question for a particular random graph model, succinctly described as the directed configuration model (DCM) plus bots. The DCM constructs a graph with prespecified degrees, which, conditioned on being simple (i.e. having no self-loops or multi-edges), is uniformly distributed among (simple) graphs of those degrees [11, Proposition 7.15]. This is an appealing property for deGroot-like learning models such as ours, because in the deGroot model for undirected graphs, asymptotic estimates depend only on the degrees and the initial beliefs. Thus, loosely speaking, our analysis is “average-case” over relevant graphs. Furthermore, we will show the graph parameters that dictate learning for the DCM are tractable, which we exploit in Section 4 for general graphs.
Having motivated our study of the DCM, we define it in Section 3.1, present our main result for the DCM in Section 3.2, and discuss our assumptions in Section 3.3.
3.1 Graph model
To begin, we realize a sequence called the degree sequence from some distribution; here we let . In the construction described next, will have outgoing neighbors ( will be observed by other agents), incoming neighbors from the ( will observe agents), and incoming neighbors from ( will observe bots). Here the total in-degree of is (as used in (5)). We assume
[TABLE]
In words, the first condition says is observed by and observes at least one agent, and may observe one or more bots. The second condition says sum out-degree must equal sum in-degree in the agent sub-graph; this will be necessary to construct a graph with the given degrees. Finally, it will be convenient to define the degree vector of as
[TABLE]
After realizing the degree sequence, we begin the graph construction.333This construction is presented more formally in Appendix 7.1 . First, we attach outgoing half-edges, incoming half-edges labeled , and incoming half-edges labeled , to each ; we will refer to these half-edges as outstubs, -instubs, and -instubs, respectively. Let denote the set of all agents’ outstubs. We then pair each outstub in with an -instub to form edges between agents in a breadth-first-search fashion that proceeds as follows:
- •
Sample from uniformly. For each the -instubs attached to , sample an outstub uniformly from (resampling if the sampled outstub has already been paired), and connect the instub and outstub to form an edge from some agent to .
- •
Let A_{1}=\{i\in A\setminus\{i^{*}\}:\textrm{an outstub of i was paired with} \textrm{an Ai^{*}}\}. For each , pair the -instubs attached to in the same manner the -instubs of were paired in the previous step.
- •
Continue iteratively until all -instubs have been paired. In particular, during the -th iteration, we pair all -instubs attached to , the agents at geodesic distance from .
The procedure above yields the standard DCM, plus unpaired -instubs attached to some agents. To pair these instubs, we define B=n+\big{[}\sum_{i\in A}d_{in}^{B}(i)\big{]} to be the set of bots (hence, the node set is A\cup B=\big{[}n+\sum_{i\in A}d_{in}^{B}(i)\big{]}). To each we add a single self-loop and a single unpaired outstub (as described at the end of Section 2). This yields unpaired outstubs attached to bots. Finally, we pair these outstubs arbitrarily with the unpaired -instubs from above to form edges from bots to agents (the pairing can be arbitrary since all bots behave the same).
We note that the pairing of -instubs with outstubs from did not prohibit multi-edges, so the set of edges formed will in general be a multi-set. For this reason, we replace the summation in the update (3) with
[TABLE]
and analogously for the update, i.e. we weigh the parameters of ’s neighbors proportional to the number of edges pointing to . We also note that if , the construction above reduces to the standard DCM.
Our results will require assumptions on the degree sequence , where (we recall) is the degree vector of (see (9)). First, we define by
[TABLE]
In words, and are the degree distributions of agents sampled uniformly and sampled proportional to out-degree, respectively. Note that, since the first agent added to the graph is sampled uniformly from , the degrees of are distributed as . Furthermore, recall that, to pair -instubs, we sample outstubs uniformly from , resampling if the sampled outstub is already paired. It follows that, each time we add a new agent to the graph (besides ), its degrees are distributed as . We also note that, because the degree sequence is random, these distributions are random as well. From these random distributions, we define the random variables
[TABLE]
Following the discussion above, is the expected value (conditioned on the degree sequence) of the ratio of -instubs to total instubs for ; is the expected value of this same ratio, but for new agents added to the graph. The interpretation of is similar. At the end of Section 3.2, we discuss in more detail why these random variables arise in our analysis.
We now state four assumptions, which we discuss in detail in Section 3.3. Two of these require the degree sequence to be well-behaved (with high probability) – specifically, A1 requires certain moments of the degree sequence to be finite, while A3 requires to be close to a deterministic sequence . The other assumptions, A2 and A4, impose maximum and minimum rates of growth for the learning horizon . In particular, must be finite for each finite but grow to infinity with .
- A1
, where, for some independent of such that ,444The assumption only eliminates the trivial case of a line graph; see Section 3.3 for details.
[TABLE] 2. A2
and independent of s.t. . 3. A3
, where, for some s.t. , some , and some independent of ,
[TABLE] 4. A4
.
3.2 Main result
We can now present Theorem 1. The theorem states that the estimate at time of a uniformly random agent converges in probability as . As discussed in the introduction, the limit depends on the relative asymptotics of the time horizon and the quantity defined in A3. For example, this limit is when ; note that requires to quickly approach 1 (since by A4), which by A3 and (13) suggests the number of bots is small. Hence, learns the true state when there are sufficiently few bots. (The other cases can be interpreted similarly.)
Theorem 1
Assume that is the DCM and that A1, A2, A3, and A4 hold. Then for uniformly,
[TABLE]
Before discussing the proof, we make several observations:
- •
Suppose is fixed and consider varying . To be concrete, let and define and (note satisfy A2, A4). Then and , so by Theorem 1, the estimate of converges to at time and to 0 at time . In words, initially (at time ) learns the state of the world, then later (at time ) forgets it and adopts the bot estimates.
- •
Alternatively, suppose is fixed and consider varying . For example, let for some . Here smaller implies fewer bots, and Theorem 1 says the limiting estimate of is a decreasing convex function of . One interpretation is that, if an adversary deploys bots in hopes of driving agent estimates to 0, the marginal benefit of deploying additional bots is smaller when is larger, i.e. the adversary experiences “diminishing returns”. It is also worth noting that, since as and as , the limiting estimate of is continuous as a function of .
- •
If , consider the limiting estimate of as a function of . By Theorem 1, this estimate tends to as and tends to as . This is expected from (3): when , agents ignore the network (and thus avoid exposure to biased bot beliefs) and form estimates based only on unbiased signals; when , the opposite is true.
- •
If , we must have (since by A4), and the estimate of tends to 0 by Theorem 1. Loosely speaking, this says that a necessary condition for learning is that the bots vanish asymptotically (in the sense that ).
- •
In fact, in the case , a stronger result holds: the set of agents for which vanishes relative to . See Appendix 6 for details.
The proof of Theorem 1 is lengthy and deferred to Appendices 7 and 9, where Appendix 7 lays out the structure of the proof. However, we next present a short argument to illustrate the fundamental reason why the three cases of the limiting estimate arise in Theorem 1.
At a high level, these three cases arise as follows. First, when , the “density” of bots within the -step incoming neighborhood of is small. As a consequence, is not exposed to the biased beliefs of bots by time and is able to learn the true state (). On the other hand, when , this “density” is large; is exposed to bot beliefs and thus adopts them. Finally, when , the “density” is moderate; does not fully learn, nor does fully adopt bot beliefs.
This explanation is not at all surprising; what is more subtle is what precisely density of bots within the -step incoming neighborhood of means. It turns out that the relevant quantity is the probability that a random walker exploring this neighborhood reaches the set of bots. To illustrate this, consider a random walk that begins at and, for , chooses uniformly from all incoming neighbors of (agents and bots); note here that the walk follows edges in the direction opposite to their polarity in the graph. For this walk, it is easy to see that, conditioned on the event , the event occurs with probability
[TABLE]
Crucially, we sample this walk and construct the graph simultaneously, by choosing which instub of to follow before actually pairing these instubs. Assuming they are later paired with agent outstubs chosen uniformly at random, and hence connected to agents chosen proportional to out-degree, we can average (21) over the out-degree distribution to obtain that occurs with probability
[TABLE]
Now since bots have a self-loop and no other incoming edges, they are absorbing states on this walk. It follows that if and only if ; by the argument above, this latter event occurs with probability . Since by A3, we thus obtain that with probability
[TABLE]
From this final expression, Theorem 1 emerges: when , the random walker remains in the agent set with probability ; this corresponds to avoiding exposure to bot beliefs and learning the true state. Similarly, means the walker is absorbed into the bot set with probability , corresponding to adopting bot estimates. Finally, means the walker stays in the agent set with probability , corresponding to not fully learning nor fully adopting bot estimates.
We note that the actual proof of Theorem 1 does not precisely follow the foregoing argument. Instead, we locally approximate the graph construction with a certain branching process; we then study random walks on the tree resulting from this branching process.555This is necessary because the argument leading to (22) assumes instubs are paired with outstubs chosen uniformly at random, which is not true if resampling of outstubs occurs in the construction from Section 3.1. However, the foregoing argument illustrates the basic reason why the three distinct cases of Theorem 1 arise. We also observe that the argument leading to (22) shows why enters into our analysis. The other random variables defined in (13) enter similarly. Specifically, arises in almost the same manner, but pertains only to the first step of the walk; this distinction arises since the walk starts at , the degrees of which relate to . On the other hand, arises when we analyze the variance of agent estimates. This is because analyzing the variance involves studying two random walks; by an argument similar to (22), the probability of both walks visiting the same agent is
[TABLE]
Finally, we note that the proof of Theorem 1 reveals that the variance of each agent’s belief vanishes, so beliefs converge to Dirac measures. Combined with the theorem, this yields the following corollary. See Appendix 10 for a proof.
Corollary 1
Assume is the DCM and A1, A2, A3, and A4 hold. Let denote the limit (in probability) of from Theorem 1. Then for any and for uniformly, .
3.3 Comments on assumptions
We now return to comment on the assumptions needed to prove our results. First, A1 states that certain empirical moments of the degree distribution – namely, for uniformly, the first two moments of and the correlation between and – converge to finite limits. Roughly speaking, this says our graph lies in a sparse regime, where typical node degrees do not grow with the number of nodes.666This is analogous to e.g. an Erdős-Rényi model with edge probability for constant , where degrees converge to random variables. We also note in A1 is minor and simply eliminates an uninteresting case. To see this, first note that when holds, we have (roughly)
[TABLE]
where we have used the assumed inequality . Hence, cannot occur, so assuming only prohibits . This remaining case is uninteresting because is the limiting number of offspring for each node in the branching process we analyze; thus, if , the tree resulting from this process is simply a line graph.
Next, A2 states . Together with A1, these assumptions are standard given our analysis approach, which, as discussed previously, locally approximates the graph construction with a branching process. We also note that, with the interpretation of above, it follows that the number of agents within the -step neighborhood of is roughly
[TABLE]
In words, the size of the aforementioned neighborhood vanishes relative to . This is why our title refers to the learning as “local”: only a vanishing fraction of other agents (those within this neighborhood) affect the estimate of .
The remaining statements are needed to establish estimate convergence on the tree resulting from the branching process. A4 states with , which is an obvious requirement for convergence. A3 essentially says that three events occur with high probability. First, should be close to a convergent, deterministic sequence ; this is necessary since the asymptotics of play a prominent role in Theorem 1. Second, essentially says that bots prefer to attach to agents with higher out-degrees, i.e. more influential agents; this is the behavior one would intuitively expect from bots aiming to disrupt learning. Third, is satisfied if, for example, all agents have total in-degree at least two.
Finally, while we focused on the DCM in this section, our analytical approach is more general. At a high level, the key properties of the DCM we used are that most nodes’ -step neighborhoods are treelike and “statistically similar,” which allows for a branching process coupling. Such couplings exist more generally, though this scaling will be smaller for denser graphs, which makes smaller as well.
4 Adversarial setting
We next formalize the adversarial problem introduced in Section 1. We begin with some notation. Let , and (with slight abuse of notation to the previous section), define the function by
[TABLE]
which is simply , as defined in (13), viewed as a function of the bot in-degrees 777We suppress the sub- and super-scripts to avoid cumbersome notation.. Given a budget , the adversary’s problem is then as follows:
[TABLE]
Thus, the adversary’s objective function only depends on the agent degrees (e.g. numbers of followers and followees on Twitter), and not the topology of the agent sub-graph. Consequently, the topology will play no role in this section, i.e. we do not require the DCM assumption. We reiterate that, by Theorem 1, solving (28) is equivalent to minimizing estimates asymptotically for the DCM.888More precisely, this only holds if the solution of (28) converges in the sense of A3. We are unsure if this holds, but we view it as a minor technical point and leave it as an open problem. For general graph topologies, we treat (28) as a nonrigorous but tractable surrogate for estimate minimization, and we will soon show empirically that this is a reasonable choice.
4.1 Exact solution
First, we let and rewrite (28) as , where
[TABLE]
In words, we incorporated the constraints from (28) into the objective; we also used the (obvious) fact that the solution of (28) satisfies the budget constraint with equality. The new objective satisfies a certain discrete convexity property, which implies that minimizes if and only if for any pair. Hence, we can find the minimizer by iteratively replacing with until the objective stops decreasing. This approach is known as steepest descent [12, Section 10.1.1] and is provided in Algorithm 1. In Appendix 8.5, we show its runtime is in the best case and in the general case.
4.2 Approximation algorithm
Algorithm 1’s runtime is prohibitive for massive networks like Twitter, which motivates our approximation scheme. The idea is to first solve the relaxed problem
[TABLE]
and then to sample bot locations in proportion to the relaxed solution. More formally, our approximate solution is constructed via Algorithm 2. We note that by definition, the budget constraint holds with equality for Algorithm 2. Also, as shown in Appendix 8.1, the solution of (30) is
[TABLE]
where , , , and
[TABLE]
This randomized scheme yields useful insights, in contrast to the optimal algorithm. In particular, the randomized and relaxed solutions and are equal in expectation, and the relaxed solution satisfies some intuitive properties:
- •
grows with , i.e. the adversary targets agents with large and small under the relaxed solution. Here large means is influential (e.g. has many Twitter followers), while small means is susceptible to influence (e.g. has few Twitter followees, so bot tweets will appear prominently in ’s Twitter feed).
- •
If , then . Hence, if is sufficiently non-influential, and/or sufficiently non-susceptible, targeting gives no value to the adversary.
- •
If , the relaxed solution yields
[TABLE]
This can be interpreted as follows: the adversary strives for a similar proportion of fake news in the feeds of users with similar ratios of influence to susceptibility.
In short, our approximate solution strives to balance influence and susceptibility. While somewhat intuitive, the precise manner in which this balance occurs (in particular, the form of (31)-(32)) is far from obvious.
In Appendix 8.5 , we show Algorithm 2 has complexity . In terms of accuracy, we next prove that with high probability, Algorithm 2 is a -approximation algorithm for the constrained problem , which is equivalent to (28). More precisely, letting be any solution of (28), i.e.
[TABLE]
we have the following result.
Theorem 2
Let and . Then
[TABLE]
Proof 4.3**.**
As mentioned above, Appendix 8.1 shows (31) solves (30) (the proof amounts to verifying KKT conditions, see e.g. [13, Section 5.5.3]). Hence, by definition (34),
[TABLE]
We next rewrite in terms of the random vector from Algorithm 2. Toward this end, let , and for define
[TABLE]
Then a simple calculation yields
[TABLE]
and using Jensen’s inequality, one can show
[TABLE]
(see Appendix 8.2 for details.) Combining (37)-(40),
[TABLE]
Also, using (40) and recalling , we have
[TABLE]
By the previous two lines, the following implies the theorem:
[TABLE]
Such an inequality would follow from a simple Hoeffding bound if was simply ; however, is a much more complicated function. Fortunately, belongs to a special class called self-bounding functions [14, Section 3.3], for which concentration inequalities of the form (43) are known. See Appendix 8.3 for details.
The tail bound in Theorem 2 is opaque, as it relies on , which (in general) is difficult to interpret. Under certain assumptions, we can obtain more transparent results. For example, we have the following corollary.
Corollary 4.4**.**
Let as above. Assume and for some independent of ,
[TABLE]
Then s.t. and
[TABLE]
Proof 4.5**.**
Since solves (30), we can weaken the bound in Theorem 2 by replacing with for any with . Thus, the proof chooses a particular that leads to a more tractable bound, and the assumptions ensure this bound vanishes. See Appendix 8.4 for details.
In words, the corollary shows our randomized scheme is (asymptotically) a -approximation algorithm with probability tending to . The assumption (44) only precludes the case where only finitely many of the degree ratios are comparable to the maximum . This restriction arises because our self-bounding concentration analysis in Theorem 2 requires normalization by (see Appendix 8.3.)
4.3 Empirical results
A fundamental assumption in our adversary solutions is that and are correlated, in the sense that minimizing also minimizes . While Theorem 1 states this correlation holds for the random graph model of Section 3.1, it is unclear if this correlation occurs in practice. To conclude this section, we present empirical results suggesting that this indeed occurs. In our experiments, we compare our proposed solutions against some natural heuristics:
- •
A naive baseline, which uses Algorithm 2 but samples each uniformly from .
- •
Three schemes which similarly use Algorithm 2, along with the observed degrees: sampling proportional to (i.e. targeting influential nodes), (i.e. targeting susceptible nodes), and (i.e. naively balancing the two).
- •
Sampling proportional to [15], where999In experiments, we compute the first summands, which guarantees an error bound of .
[TABLE]
where , is the length- ones vector, and is the agent sub-graph’s column-normalized adjacency matrix, i.e. the matrix with -th element
[TABLE]
PageRank is a commonly-used measure of influence or centrality for graphs in many domains [16] (and a richer such measure than ).
We compare our proposed solutions with these heuristics using four datasets from [17], described in Table 1. We chose these datasets so we could test our proposed solutions on real social networks of two scales: Gnutella and Wiki-Vote have , a scale at which the exact solution Algorithm 1 is feasible; Pokec and LiveJournal have , a scale that renders Algorithm 1 infeasible but that more closely resembles social networks of interest. For the experiments, we set (to maximize signal variance), (to emphasize the effect of the network), and (to ensure the code had reasonable runtime). We let , so that 0.25% of all agent in-edges are connected to bots. For each graph and each of five experimental trials, we chose as described above, added bots to the original graph accordingly, and simulated the learning process from Section 2.
In Figure 1, we plot the mean and standard deviation (across experimental trials) of as a function of . For all datasets, our proposed solutions outperform all heuristics, in the sense that our solutions yield the lowest average for most values of . Furthermore, we note the following:
- •
Across all graphs, our solutions outperform for all values of tested. This is quite surprising, because PageRank uses the entire graph topology, whereas our solutions only use degree information. Also, as becomes increasingly smaller, performs increasingly better, but this comes at the cost of higher runtime to estimate .
- •
Among the heuristics using (at most) degree information, performs best – but still worse than Algorithm 2 – across all datasets. Put differently, naively balancing influence and susceptibility is not enough; the non-obvious form of Algorithm 2 yields better performance.
- •
For Gnutella and Wiki-Vote, Algorithm 1 noticeably outperforms Algorithm 2. Though the former is an exact solution and the latter is an approximation, this is still surprising, since it is unclear that these schemes are even optimizing the correct objective for real graphs.
While Figure 1 only considers one choice of , we believe our conclusions are robust. In particular, we also tested the cases for each , so that between and of edges connected to bots (thus, Figure 1 shows the intermediate case ). Appendix 8.6 contains a figure analogous to Figure 1 for the other choices of ; the plots are qualitatively similar.
We have thus far shown that our solutions outperform heuristics, even those using graph topology. This is quite surprising: our solutions were derived under the fundamental assumption that minimizing amounts to minimizing , but we only verified this assumption asymptotically for a class of random graphs. Thus, our empirical results suggest that even for real social networks, this assumption holds. Indeed, in Figure 2 we show scatter plots of against (each dot represents one experimental trial). For all datasets, the two quantities are closely correlated.
5 Related work
As discussed in Section 2, (3) resembles the non-Bayesian social learning model from [4], which uses belief update
[TABLE]
where , is a signal, and BU means Bayesian update. Hence, agents perform Bayesian updates and then average in terms of beliefs in [4] but parameters in this work. The main advantage of the latter is that beliefs remain Beta distributions, which simplifies our analysis. This simplification, along with weights instead of (48), are needed since we consider a finite horizon and a graph which need not be connected, in contrast to [4]. Another distinction is that agents in [4] cannot learn the true state individually, and need the network for learning. In contrast, agents in our work can learn in isolation (simply by averaging their signals), so the network can either speed up learning or be a detriment. We highlight here the detriment with our model relevant to platforms like Twitter, where users who could have read accurate news in isolation instead of risking exposure to bots.
Our parameter update is also studied in [6], which features bots defined in a slightly different manner but in the same spirit. However, [6] only includes theoretical results in the case ; the case is studied empirically. This allowed [6] to use a slightly richer model, including a time-varying graph and agent-dependent mixture parameters . Notably, the empirical results from [6] fix a learning horizon and do not investigate the effects of different timescales; in particular, the delicate relationship between timescale and bot prevalence from Theorem 1 is not brought to light. Beyond stubborn agents, [18, 19] propose different non-Bayesian updates to cope with Byzantine agents with arbitrary behavior.
From an analytical perspective, our approach of analyzing estimates by studying random walks is similar to the deGroot model [10]. Here the estimate vector is updated as for some column-stochastic matrix . Hence, , so ’s belief is determined by the distribution of a -step random walk from . This observation has been exploited in the literature; see the surveys [20, Section 3] and [21, Section 4], and the references therein. For example, assuming is irreducible and aperiodic, and therefore has a well-defined stationary distribution , [7] establishes conditions for learning using the fact that when is large. Roughly speaking, our model combines deGroot-like averaging with exogenous unbiased signals. As discussed, the averaging in our case exposes agents to biased beliefs (due to bots); the resulting tension between biased and unbiased information is a key feature in our model not present in deGroot’s. Ours is arguably a richer model of platforms like Twitter, where there is a similar tension between legitimate news and bots. Beyond the deGroot model, agents in [22] perform Bayesian updates using the prior of a randomly-chosen neighbor, which yields a different connection to random walks; assuming strong connectedness, the authors exploit the fact that the walk visits every agent infinitely often (i.o.) to derive conditions for learning.
Similar to [4], the papers of the previous paragraph typically assume strong connectedness and long learning horizons so as to leverage properties such as stationary distributions and i.o. visits. This is a fundamental distinction from our work. Indeed, even if we disregard stubborn agents, the random walk converges to a stationary distribution, but it does not converge within our local learning horizon. This is because, as shown in [23], the DCM we consider has mixing time that exceeds
[TABLE]
where we used Jensen’s inequality and (25). The right side exceeds by A2, i.e. our learning horizon occurs before the underlying random walk mixes. In fact, [23] shows that the random walk on the DCM exhibits cutoff, meaning that the -step distribution of this walk can be maximally far from the stationary distribution (i.e. the total variation distance between these distributions can be 1 for certain starting locations of the walk). Hence, not only can we not use this stationary distribution, we cannot even use an approximation of it. Again, this means our analysis cannot leverage global properties typically used when relating estimates to random walks. We circument this using the DCM, which has a well-behaved local structure. We also note that our idea to simultaneously construct the graph and sample the walk is taken from [23].
Some other works have considered social learning with stubborn agents. For example, [8] studies a model in which agents meet and either retain their own estimates, adopt the average of their estimates, or adopt a weighted average; the agent whose estimate has a larger weight is called a “forceful” agent. Here the authors show that all agent estimates converge to a common random variable and study its deviation from the true state. A crucial difference between this work and ours is that [8] assumes even forceful agents occasionally observe other agents’ opinions. This yields an underlying Markov chain that is irreducible (unlike ours); the analysis then relies on this chain having a well-defined stationary distribution.
Stubborn agents have also been considered in the consensus setting [24], which asks whether agent estimates converge to a common value, i.e. a consensus. For example, [25] considers a model in which regular agents adopt weighted averages of estimates upon meeting other agents, while stubborn agents always retain their own estimates. This intuitively prohibits a consensus from forming; indeed, it is shown that agent estimates fail to converge, i.e. disagreement can persist indefinitely. Another example is [26], in which an agent’s estimate at time is a weighted average of their own estimate at time 0 and their neighbors’ estimates at time . In this model, stubborn agents place all weight on their own estimate from time 0 and thus do not update their estimates. The analysis in [26] is similar to ours as it relates agent estimates to hitting probabilities of the stubborn agent set, but it differs as the learning horizon is infinite in [26]. Also in the consensus setting, [27] investigates protocols for robust consensus that may lessen the undesirable effects of stubborn agents.
The problem of deploying stubborn agents is studied in [28, 29], though for the voter model. Both assume knowledge of a matrix describing the graph topology (like from Section 4.3), and the optimization requires inverting this matrix at complexity . Our algorithms overcome both of these issues. We also note this inversion is common in more general influence maximization settings.
Without stubborn agents, [30] considers a non-Bayesian update for infinite horizons, where agents treat neighbors’ beliefs as independent. Convergence rates are provided in [9, 31, 32] for (3) or similar Bayesian-plus-aggregation updates. An open question is how these models behave with stubborn agents, particularly for [9, 31, 32], where the convergence may be slower than the propagation of stubborn agent bias.
6 Special case
While Theorem 1 establishes convergence for the estimate of a typical agent, a natural question to ask is how many agents have convergent estimates. Our second result, Theorem 6.6, provides a partial answer to this question. To prove the result, we require slightly stronger assumptions than those required for Theorem 1 (we will return shortly to comment on why these are needed). First, we strengthen A1 and A3 to include particular rates of convergence for the probabilities . Second, we strengthen A4 with a minimum rate at which (specifically, ). Third, and perhaps most restrictively, we require in A1. As a result, Theorem 6.6 only applies to the case , for which Theorem 1 states the estimate of a uniform agent converges to zero. In this setting, Theorem 6.6 provides an upper bound on how many agents’ estimates do not converge to zero. In particular, this bound is for some .
Theorem 6.6**.**
Assume and independent of s.t. the following hold:
- •
A1*, with .*
- •
A2*.*
- •
A3*, with and .*
- •
A4*, with .*
Then for any , , and , all independent of ,
[TABLE]
We reiterate that by A2 and by the theorem statement. Hence, , so one can choose in Theorem 6.6 to show that the size of the non-convergent set of agents vanishes relative to . We suspect that such a result is the best one could hope for; in particular, we suspect that showing all agent estimates converge to zero is impossible. This is in part because our assumptions do not preclude the graph from being disconnected. Hence, there may be small connected components composed of agents but no bots; in such components, agent estimates will converge to (not zero). Additionally, while the lower bound for in Theorem 6.6 is somewhat unwieldy, certain terms are easily interpretable: the bound sharpens as grows (i.e. as agents place less weight on their unbiased signals), as decays (i.e. as the number of bots grows), and as decays (i.e. as signals are more likely to be zero, pushing estimates to zero).
As for Theorem 1, the proof of Theorem 6.6 is outlined in Appendix 7 with details provided in Appendix 9. The crux of the proof involves obtaining a sufficiently fast rate for the convergence in Theorem 1; namely, we show that for some , .101010One may wonder why we derive a separate bound for Theorem 6.6, since we have already bounded to prove Theorem 1. The reason for this is that the bound for Theorem 1 does not decay quickly enough as to prove Theorem 6.6; on the other hand, the bound for Theorem 6.6 does not decay at all as for the case and therefore cannot be used for all cases of Theorem 1. See Appendix 7.4.2 for details. At a high level, obtaining such a bound requires bounding three probabilities by , which also helps explain the stronger assumptions of Theorem 6.6:
- •
As for Theorem 1, we first locally approximate the graph construction with a branching process so as to analyze the estimates on a tree. Here strengthening A1 with is necessary to ensure this approximation fails with probability at most .
- •
To analyze the estimates on a tree, we first condition on the random tree structure and treat the estimate as a weighted sum of i.i.d. signals using an approach similar to Hoeffding’s inequality. Namely, we obtain the Hoeffding-like tail ; strengthening A4 with is necessary to show this tail is .
- •
Finally, after conditioning on the tree structure, we show this structure is close to its mean. More specifically, letting denote the expected estimate for the root node in the tree conditioned on the random tree structure (see Appendix 7 for details), we show
[TABLE]
Note the only source of randomness in is the random tree; because this tree is recursively generated, it has a martingale-like structure that can be analyzed using an approach similar to the Azuma-Hoeffding inequality for bounded-difference martingales. Here we require to ensure the degree sequence is ill-behaved with probability at most ; we also require in this step (and only in this step).
We now address the most notable difference between Theorems 1 and 6.6; namely, that the latter only applies when . We believe this reflects a fundamental distinction between the cases and and is not an artifact of our analysis. An intuitive reason for this is that more bots are present in the former case, so fewer random signals are present (recall we model bot signals as being deterministically zero). As a result, is “less random”, so its concentration around its mean is stronger. Towards a more rigorous explanation, we first note that Appendix 7.4.1 provides the following condition for extending Theorem 6.6 to other cases of :
[TABLE]
where is the limit from Theorem 1 based on the relative asymptotics of and , i.e.
[TABLE]
It is the convergence of in (52) that we suspect is fundamentally different in the cases and . To illustrate this, we provide empirical results in Figure 3. In the leftmost plot, we show versus ; here the plot is on a log-log scale, so a line with slope means . Hence, we are comparing four cases: , so that (blue circles); , so that and (orange squares); , so that (yellow diamonds); and , so that (purple triangles). The second plot reflects the corresponding cases of : decays to zero in the first two cases, grows towards in the fourth case, and approaches an intermediate limit in the third case. The final two plots illustrate the convergence (or lack thereof) in (52). Here the empirical mean of the error term decays quickly for the first case but decays more slowly (or is even non-monotonic) in the other cases. More strikingly, the empirical variance of this error term is several orders of magnitude smaller in the first case. This suggests that decays much more rapidly in the case , which is why we believe this is the only case for which (52) is satisfied.
In addition to the summary statistics shown in Figure 3, we also show histograms of error term across the 400 trials in Figure 4. As discussed above, this term must converge to zero (in probability) at a sufficiently fast rate to prove Theorem 6.6. In Figure 4, these histograms appear to converge quickly to a point mass at zero in the case ; in other cases, such behavior does not occur, further suggesting a fundamental difference between the cases.
We note here that basic workflow of the experiment above proceeded as follows:
- •
Choose a sequence of time horizons that increase linearly, then set accordingly.
- •
Realize the degrees after selecting .
- •
Define the empirical distributions using the degrees as in (11).
- •
Evaluate quantity of interest empirically via (87) using .
We repeated this experiment 400 times to obtain 400 samples of ; the plots in Figure 3 show empirical means and variances across these 400 samples. We used the following parameters:
- •
We set to emphasize the effect of the network.
- •
We let , so that ; we choose independent of so that , as required by A1. In particular, we choose .
- •
After realizing , we assign one outgoing edge to each , then assign each of the remaining outgoing edges independently and uniformly at random. Note that this implies and , as required by (7).
- •
We let , with , so that
[TABLE]
(This is not precisely what we desire, since A3 assumes for sampled proportional to out-degree; however, as shown in the second plot in Figure 3, this empirically yields distinct cases rates of convergence for .)
- •
We compare four cases of : and for , with and independent of . Note that the three latter cases satisfy
[TABLE]
as shown in Figure 3. Here and were chosen via trial-and-error so that all four cases behaved roughly the same at the smallest value of (as in Figure 3). In particular, we chose
[TABLE]
- •
We let ; here the minimum of 2 was chosen since is a trivial case and the maximum of 11 was chosen due to computational limitations.
- •
Given , we let . Note that this implies , roughly the upper bound in A2. With our choice of and , ranged from 20 to (roughly) 12 million.
7 Proof of Theorems 1 and 6.6 (outline)
The proofs of Theorems 1 and 6.6 proceed in two steps. First, we show that the graph construction can be locally approximated by a certain branching process. Second, we analyze the estimates of agents in the graph by instead analyzing the estimates of agents in the tree resulting from the branching process. We note that studying tree agent estimates rather than graph agent estimates is advantageous because the tree has a comparatively simple structure that is more amenable to analysis.
The first step is identical for both theorems, while the second step requires a different analysis for each theorem. In Appendix 7.1, we outline the first step, and in Appendices 7.2 and 7.3, respectively, we outline the second step for Theorems 1 and 6.6, respectively. To highlight the key ideas of our analysis, we defer many details to Appendix 9; in particular, proofs pertaining to Appendices 7.1, 7.2, and 7.3 , respectively, can be found in Appendices 9.1, 9.2, and 9.3, respectively. Finally, we note that throughout the analysis we use and , respectively, to denote probability and expectation, respectively, conditioned on the degree sequence .
7.1 Branching process approximation (Step 1 for proofs of Theorems 1 and 6.6)
We first show that the estimate of any agent in the graph depends (asymptotically) only on the structure of the agent’s local neighborhood and on certain signals realized within this neighborhood. This will facilitate the definition of the branching process with which we will approximate the graph construction. Importantly, the graph agent’s estimate will not depend on the prior parameters (asymptotically). This is necessary as we have not specified these priors (beyond assuming they are bounded by some independent of , as discussed in Section 2).
To begin, we require some notation. Let denote the graph’s column-normalized adjacency matrix, i.e. , and set , where is the identity matrix of appropriate dimension. (Recall from Section 3.1 that is in general a multi-set; hence, the numerator in may exceed 1.) Next, for , let denote the collection of signals in vector form. Finally, for define
[TABLE]
We note that (57) can be rewritten as
[TABLE]
where we have used the fact that . From this expression, it is clear that only depends on the structure of the -step neighborhood into (since only this sub-graph affects the terms) and on certain signals within this neighborhood, as mentioned above. We can then establish the following.
Lemma 7.7**.**
Given A4, s.t. , .
Proof 7.8**.**
See Appendix 9.1.1.
Before defining the aforementioned branching process, we formally define the graph construction described in Section 3.1. For this, we will use the following additional notation.
- •
We let denote the set of agents at distance from the initial agent , i.e. means a path from to of length exists, but no shorter path exists (hence, , , etc.). Similarly, we let denote the set of bots at distance from .
- •
We let denote the set of outstubs belonging to ; we let denote the set of all such outstubs.
- •
For each , we define a label as follows:
[TABLE]
We will explain the utility of these labels shortly.
With this notation in place, we present the formal graph construction as Algorithm 3. We offer some further comments to help explain the algorithm:
- •
The algorithm takes as input the degree sequence , which is used in Line 3 to define . Also in Line 3, we label all outstubs as 1 (since no agents have been added to the graph), and we initialize the set of bots to the empty set.
- •
In Line 3, we sample the agent from which the graph construction begins. Since then belongs to the graph, we change the labels of its outstubs to 2.
- •
For the remainder of the algorithm, we proceed in a breadth-first-search fashion, looping over distance and agents at distance from . For each such agent, we do the following:
- –
For each of the instubs of intended for pairing with agent outstubs, we sample an agent outstub uniformly (Line 3), resampling until an unpaired outstub (i.e. one with label 1 or 2) has been found (Line 3). Upon finding such an outstub, denoted , we pair it with ’s instub to form an edge from to (Line 3). Note that implies was added to the graph when edge was formed; hence, because , is at distance from and must be added to (Line 3). Finally, we update the labels of the outstubs of via (59) (Lines 3-3). (Line 3 will be used in the branching process approximation and will be discussed shortly.)
- –
For each of the instubs of intended for pairing with bot outstubs, we add a new bot with a self-loop and an unpaired outstub to the set of bots, updating accordingly (Line 3), and then add an edge from the new bot to (Line 3). Note here that at the start of the construction; it follows that the -th bot added to the graph is , so is the set of bots at the end of the construction.
- –
Finally, if all agent outstubs have been paired, the construction terminates (Line 3).
We now return to discuss Line 3 of Algorithm 3. Here denotes the first iteration at which an outstub with label 2 or 3 is sampled for pairing with an instub. Put differently, means that for the first iterations of the construction, only outstubs with label 1 have been sampled. This has two consequences. First, no edges have been added between two nodes both at distance from , i.e. the -step incoming neighborhood of is a tree (except for the self-loops attached to bots). Second, no resampling of outstubs has occurred (Line 3); this implies that the outstub paired in Line 3 is chosen uniformly from , so the degrees of are distributed according to the out-degree distribution defined in (11).
These observations motivate a tree construction that we define next. In particular, we will construct a tree (except for bot self-loops) with edges pointing towards the root. Agents will be added to the tree with degrees sampled from , except for the root node, whose degrees are sampled from (also defined in (11)), corresponding to the degrees of in the graph construction.
The tree construction requires further notation. First, we let (, respectively) denote agents (bots, respectively) at distance from the tree’s root. We also set . (Here and moving forward, we use to distinguish tree-related objects from similarly-defined graph-related ones.) At times, we will use branching process terminology and e.g. refer to as the -th generation of agents. We let denote the root node, so that . We will denote generic node in as ; here encodes the ancestry of , i.e. is the child of , who is in turn the child of , etc. Finally, for such and for , is the concatenation operation and denotes ’s ancestor in generation , with by convention (note also that for such ).
With this notation in place, we define the tree construction in Algorithm 4. We offer several more explanatory comments:
- •
Lines 4 and 4-4 define a particular random walk that will be used in Appendix 7.2; they do not affect the tree structure and we defer further explanation to Appendix 7.2.
- •
As mentioned above, the root node has degrees sampled from (Line 4), while all other agents have degrees sampled from (Line 4).
- •
In Line 4, a directed edge is added from to ; the other outstubs of are left unpaired so that the tree structure is preserved (except for bot self-loops).
- •
At the conclusion of the -th iteration, has incoming neighbor set (offspring, in the branching process terminology) . More specifically, the subset of ’s incoming neighbors are agents (Line 4), while the subset of ’s incoming neighbors are bots (Line 4).
- •
Unlike the graph construction, the tree construction continues indefinitely, yielding an infinite tree (except for bot self-loops) with edges pointing towards the root node .
Having defined the tree construction, we also define as in (57) but using the tree from Algorithm 4 instead of the graph from Algorithm 3. Specifically, we let
[TABLE]
where ; ; ; and is the column-normalized adjacency matrix of the tree from Algorithm 4. We pause to note that
[TABLE]
where the first inequality holds since (60) is a sum of nonnegative terms, the second follows since component-wise (where is the all ones vectors) and since is element-wise nonnegative, and the equality holds by column stochasticity of .
We can now state Lemma 7.9, which relates the estimate of a uniformly random agent in the graph with the estimate of the root node in the tree. For the first statement in the lemma, we argue that, conditioned on , the -step neighborhood of in the graph and the -step neighborhood of in the tree are constructed via the same procedure; since the signals are defined in the same manner as well, this implies and have the same distribution. The second statement of the lemma says that the condition occurs with high probability; it is essentially implied by [33, Lemma 5.4]. We note that the assumptions A1 and A2 are required for this second statement to hold, and are standard assumptions needed to locally approximate a sparse random graph construction with a branching process. Finally, we recall by A2, which is why the limit shown in Lemma 7.9 holds.
Lemma 7.9**.**
Assume A1 and A2 hold, and let denote equality in distribution. Then
[TABLE]
Proof 7.10**.**
See Appendix 9.1.2.
We can now state and prove Lemma 7.11, which is the main result for Step 1 of the proofs of the theorems. This result will allow us to analyze convergence of (the graph agent estimate) by instead analyzing convergence of (the tree agent estimate).
Lemma 7.11**.**
Assume A1, A2, and A4 hold. Then and all sufficiently large,
[TABLE]
Proof 7.12**.**
First, given , we have for sufficiently large ,
[TABLE]
where the first inequality uses the triangle inequality and in the second we used Lemma 7.7 to bound by Furthermore, by the law of total probability, we have
[TABLE]
Combining the previous two inequalities and using Lemma 7.9 (which applies since A1, A2 are assumed to hold), we obtain
[TABLE]
which is what we set out to prove.
Before proceeding, we state another lemma that will be used in Step 2 of the proofs for both theorems. This lemma uses the fact that each agent in the tree has a unique path to the root. As a result, we can obtain an alternate expression for the terms appearing in (60).
Lemma 7.13**.**
For each ,
[TABLE]
where by convention when .
Proof 7.14**.**
See Appendix 9.1.3.
7.2 Step 2 for proof of Theorem 1
Our next goal is to establish convergence of , from which convergence of will follow via Lemma 7.11. For this, we will use Chebyshev’s inequality, so we begin with two lemmas describing the limiting behavior of the mean and variance of . Here and moving forward, for random variables and we use and to denote variance and covariance conditional on the degree sequence.
Lemma 7.15**.**
Given A3 and A4, we have the following:
[TABLE]
Proof 7.16**.**
See Appendix 9.2.1.
Lemma 7.17**.**
Proof 7.18**.**
See Appendix 9.2.2.
Before proceeding, we briefly describe our approach to proving these lemmas. First, we note that in analyzing the moments of , the i.i.d. Bernoulli random variables in (67) are easily dealt with; the difficulty arises from the terms. Luckily, there is a simple interpretation of these terms that guides our analysis and that proceeds as follows. First, define a random walk with and chosen uniformly from the incoming neighbors of , for each . Then, as shown in (181) in Appendix 9.2.1,
[TABLE]
In short, computing the mean of amounts to computing hitting probabilities of the form . Similarly, to analyze the second moment of , we compute hitting probabilities of the form , where is defined in the same manner as and is conditionally independent of given the tree structure. We note that, in principal, the -th moment of can be computed by analyzing walks. However, the calculations become exceedingly complex as grows, and because we only require two moments, we do not study any case .
This interpretation explains Lines 4 and 4-4 of Algorithm 4: in Line 4, we begin two random walks at the root node ; each time Lines 4-4 are reached, we advance the random walks one step. Importantly, we simultaneously sample the walks and construct the tree. In particular, the -th step of the walk is taken at Line 4, before the degrees of the corresponding node are realized in Line 4; this is crucial to our computation of the aforementioned hitting probabilities. Finally, we note that in Line 4 of Algorithm 4, the condition implies the walk reaches the set of bots ; since bots have self-loops but no other incoming edges, they act as absorbing states on the walk. This is why the entire future trajectory of the walk can be defined in Line 4.
In Lemmas 7.19 and 7.21, we compute the hitting probabilities needed for the proofs of Lemmas 7.15 and 7.17. We note that, in addition to the random variables defined in (13) in Section 3.1, Lemma 7.21 requires the definition of several similar random variables; we define these in (72) (and also recall the definitions of for convenience). We discuss these in more detail shortly.
[TABLE]
Lemma 7.19**.**
We have
[TABLE]
Proof 7.20**.**
See Appendix 9.2.4.
Lemma 7.21**.**
For , we have
[TABLE]
Furthermore,
[TABLE]
Proof 7.22**.**
See Appendix 9.2.5.
Before proceeding, we comment on the form of (75), which helps explain the definitions in (72). Namely, in (75), is the probability of the two random walks visiting different agents on the first step of the walk ( term), then separately remaining in the agent set for the next steps of the walk ( term); similarly, is the probability of the walks visiting the same agents for steps ( term), then visiting a different agent on the -th step ( term), then separately remaining in the agent set for steps ( term); finally, is the probability of the walks remaining together and in the agent set for steps. Each of these arguments follows from (72): gives the probability of a single walk proceeding to an agent ( term), gives the probability of two walks proceeding to the same agent ( term for the first walk, term for the second walk), and gives the probability of two walks proceeding to different agents ( term for the first walk, term for the second walk). Similar arguments apply to , except these pertain to the first steps of the walks.
Equipped with Lemmas 7.15 and 7.17, we can prove Theorem 1. First, suppose . Given , we can use Lemma 7.11 to obtain (provided the limits exist)
[TABLE]
where we have used by A1 and by A2. Next, using total probability,
[TABLE]
We can further expand the first summand in (78) as
[TABLE]
where we have simply used the triangle inequality and the union bound. Now for the first summand in (80), we have (via total expectation and the conditional form of Chebyshev’s inequality)
[TABLE]
where the limit holds by Lemma 7.17. For second summand in (80), we write
[TABLE]
where the first two lines use total expectation and the inequality for (which is easily proven by considering the cases and ), and the limit holds by Lemma 7.15. Finally, combining (76), (78), (80), (81), and (83), and recalling that by A3, we obtain
[TABLE]
Since was arbitrary, we conclude that converges to in probability, completing the proof in the case . For the cases and , respectively, we can replace with and [math], respectively (the corresponding cases from Lemma 7.15), but otherwise follow the same approach.
7.3 Step 2 for proof of Theorem 6.6
Similar to the second step in the proof of Theorem 1, we begin by analyzing the limiting behavior of . However, we will use a different approach than that used in Theorem 1. This approach is made possible by the stronger assumptions of Theorem 6.6, and it will yield a fast rate of convergence that will allow us to prove the theorem.
To explain our approach, we first recall that Lemma 7.13 states
[TABLE]
Hence, letting denote the collection of random variables defining the tree structure,
[TABLE]
where we have simply used the fact that the signals are i.i.d. random variables. Our basic approach will now proceed in two steps. First, in Lemma 7.23 we condition on the tree structure, so that is simply a weighted sum of i.i.d. random variables; the lemma shows that this weighted sum is close to its conditional mean with high probability. Second, in Lemma 7.25, we show that the conditional mean converges to zero in probability. Before proceeding, we also note that an argument similar to (61) implies
[TABLE]
which will be used in the proofs of the lemmas in this appendix.
We now state Lemma 7.23. As mentioned, the proof involves analyzing a weighted sum of i.i.d. random variables; hence, our analysis is similar to the derivation of Hoeffding’s inequality.
Lemma 7.23**.**
Assume and independent of s.t. the following hold:
- •
A4*, with .*
Then ,
[TABLE]
Proof 7.24**.**
See Appendix 9.3.1.
Lemma 7.25 states that conditional mean converges to zero in probability. Note that the only source of randomness in is the tree structure. Since the tree structure is generated recursively, has a martingale-like structure; this allows us to use an approach similar to the Azuma-Hoeffding inequality for bounded-difference martingales.
Lemma 7.25**.**
Assume and independent of s.t. the following hold:
- •
A3*, with and .*
- •
A4*, with .*
Then ,
[TABLE]
Proof 7.26**.**
See Appendix 9.3.2.
With Lemmas 7.23 and 7.25 in place, we can prove Theorem 6.6. First, since , taking in Lemma 7.11 yields
[TABLE]
where the equality is by the theorem assumptions. For the first summand in (93), we write
[TABLE]
where the first equality adds and subtracts a term, the first inequality is immediate, the second inequality uses the union bound, the second equality uses Lemmas 7.23 and 7.25, and the final equality holds since implies . Substituting into (93),
[TABLE]
We can then write
[TABLE]
where we have used (98). Hence, by Markov’s inequality,
[TABLE]
where the limit holds by the assumption on in the statement of the theorem.
7.4 Other remarks
7.4.1 A sufficient condition for extending Theorem 6.6
Here we show that the condition (52) from Appendix 6 is sufficient to extend Theorem 6.6 to other cases of . Recall this condition is
[TABLE]
where is the limit from Theorem 1 based on the relative asymptotics of and , i.e.
[TABLE]
Suppose (103) holds in the case , so that . In this case, we have
[TABLE]
where the first inequality is Lemma 7.11 (which holds for all cases of ) with and the third uses Lemma 7.23 (which holds for all cases of ) and the sufficient condition (103). Hence, by the argument following (98), we obtain for any , , and ,
[TABLE]
i.e. Theorem 6.6 holds with replaced by . The same argument shows that Theorem 6.6 holds (with only a change of ) in the cases and with .
7.4.2 Comparing Step 2 for proofs of Theorems 1 and 6.6
As shown in Appendices 7.2 and 7.3, Step 2 for the proofs of both theorems involves bounding for the appropriate . One may wonder why we have conducted a different analysis for the two theorems. The reason is that, as shown in Appendix 9.3.3, the analysis for Step 2 of Theorem 6.6 yields a bound that does not decay with in the case . Hence, we have derived a bound for Theorem 1 that encompasses all cases of . On the other hand, the bound from Theorem 1 only states but does not provide a rate of convergence so cannot be used to prove Theorem 6.6. We also note Appendix 9.3.3 shows that, while the bound for Step 2 of Theorem 6.6 does decay in for the case with , it does not decay quickly enough to establish (52).
8 Section 4 proof and experiment details
8.1 Solution of the relaxed problem
We aim to show , where (we recall)
[TABLE]
We also recall , , , and
[TABLE]
First note strict convexity of for implies strict convexity of , i.e. for any and ,
[TABLE]
Also note we can rewrite the relaxed problem (30) as
[TABLE]
where . Given , we also define the Lagrangian
[TABLE]
Finally, we set (clearly, ). Now to prove the theorem, it suffices to establish the following KKT conditions (see e.g. [13, Section 5.5.3]):
, i.e. is a feasible point of (112). 2. 2.
, i.e. the first-order condition is satisfied. 3. 3.
, i.e. complementary slackness holds.
We proceed to the proofs of these three statements.
Clearly, . To show , we claim (and will return to prove) that is a fixed point of , i.e. . Assuming this claim holds, we have
[TABLE]
where the last two equalities use the fixed point claim and the definition of , respectively. 2. 2.
First, let satisfy , so that . Then
[TABLE]
Next, let satisfy , so that . Then
[TABLE] 3. 3.
For any , we have
[TABLE]
Clearly, the first term is zero if , the second is zero if , and both are zero if . Finally, holds by (114).
We return to establish the fixed point claim. We in fact prove the slightly stronger result
[TABLE]
The fixed point claim then follows, since by definition and by (120) with , where is a maximizer of . Thus, it suffices to prove (120). Towards this end, fix . We first assume and will return to address the other case. For any , we define
[TABLE]
where by convention if are such that (i.e. if the sums are over empty sets). Then by definition of , , and , we have
[TABLE]
Again by definition of , , and , and recalling , we also have
[TABLE]
Thus, combining the previous two equations, we obtain
[TABLE]
If instead , we can use the same argument to obtain
[TABLE]
8.2 Rewriting the objective function
We aim to prove (39) and (40), which we restate here for convenience:
[TABLE]
For the equality in (128), we write
[TABLE]
where the first, fourth, and fifth equalities hold by definition of (see proof of Theorem 2), (see Algorithm 2), and (see (27)), respectively, and the others are straightforward. For the inequality in (128), we first write (simliar to above)
[TABLE]
Now fix and . Then by the smoothing property,
[TABLE]
We next observe
[TABLE]
where we used independence of , Algorithm 2, and (114), respectively. Combining the previous two identities,
[TABLE]
where the first inequality is Jensen’s and the second holds since and . Substituting into (131), we obtain
[TABLE]
where the equality holds by definition of (27).
8.3 Self-bounding concentration
As mentioned in the main text, we exploit the theory of self-bounding functions.
Definition 8.27**.**
[14, Section 3.3]** Let be a measurable space, , and . We say is a self-bounding function if there exists auxiliary functions such that, for any ,
[TABLE]
where .
Theorem 8.28**.**
[14, Theorem 6.12]** Let be independent -valued random variables, define , and let be self-bounding. Then for any ,
[TABLE]
Assuming for the moment that is self-bounding, we can apply the theorem with to obtain
[TABLE]
where as in Theorem 2. This completes the proof of (43) from the main text.
To verify is self-bounding, we use the most obvious choice of auxiliary functions: let
[TABLE]
where for , i.e. we simply ignore the -th coordinate of . Towards bounding , we first observe
[TABLE]
where in (144) we computed the difference of fractions in (143), in (145) we replaced by (which is permitted due to the indicator ), and in (146) we rearranged the expression; the upper bound in (147) is obvious, while the lower bound holds since the second factor in (146) is less than 1. Using the upper bound in (147), we can then obtain
[TABLE]
On the other hand, using the lower bound in (147), along with (148)-(149), we immediately obtain . Together with (150), the first condition in Definition 8.27 holds. To verify the second condition in Definition 8.27, we use the leftmost expression in (150) to obtain
[TABLE]
8.4 Proof of Corollary 4.4
Let ; recall as by assumption. Define by
[TABLE]
Clearly, , so (see Appendix 8.1). Hence, by Theorem 2, we aim to find s.t.
[TABLE]
where as in Theorem 2. In fact, it suffices to show , since then we choose such that (for example) to ensure (153) holds. Toward this end, first note that for any ,
[TABLE]
Hence, by definition of (27),
[TABLE]
Recall by assumption, so
[TABLE]
Since also by assumption, the final expression in (155) diverges, as desired.
8.5 Other algorithmic details
We first show Algorithm 1 solves (28). We require a basic fact about discrete convexity.
Definition 8.29**.**
[12, Section 1.4.2]** Let and . Then is called M-convex if for any and any satisfying , there exists satisfying
[TABLE]
Theorem 8.30**.**
[12, Theorem 6.26]** Let be M-convex, and let . Then
[TABLE]
In words, the theorem says that minimizes if and only if cannot be decreased by an “exchange,” wherein is replaced by . Note that Algorithm 1 terminates precisely when this criteria is satisfied, so if we can show that (29) is M-convex, we obtain as a corollary that Algorithm 1 solves (28).
To show M-convexity, let s.t. . Then since , we clearly have for some . From and , it is also clear that . Hence, letting ,
[TABLE]
where we have simply used the definitions of . Similarly, we obtain
[TABLE]
Adding the previous two equations, and using the inequalities (where the first holds since and , and the second holds similarly) gives .
For the runtime of Algorithm 1, we note the following:
- •
The complexity of each iteration is dominated by the computation of . By (159), we can compute in time per pair, which yields complexity per iteration.
- •
In the best case, the initial choice of is actually a solution. However, it still requires one iteration to verify this, so the best-case complexity is .
- •
In the general case, [12, Section 10.1.1] provides a tie-breaking rule for the choice of that guarantees termination in iterations, which means complexity.
For the randomized scheme (Algorithm 2), first observe that by definition of , . Furthermore, , and thus , can be computed in time as follows:
- •
Compute a vector containing sorted in decreasing order ( time).
- •
Iteratively compute the sums in (32) at each ( time).
- •
Compute ( time).
In summary, (which contains at most elements) can be computed in time. After computing this set, , and subsequently , can each be computed in linear time. Thus, computing the relaxed solution (31) requires complexity. Finally, assuming we can obtain one sample from in time after pre-processing time (using e.g. the alias method [34, Section 3.4.1]), Algorithm 2 has total complexity .
8.6 Additional experiments
Figure 5 shows an analogue of Figure 1 with budget for each . The results are qualitatively similar to Figure 1 (Algorithm 1 outperforms Algorithm 2, which itself outperforms the heuristics). We also observe the gap between between the heuristics and our algorithms generally increases as the budget decreases for a fixed social network. Put differently, if an adversary with a limited budget spends this budget intelligently (i.e. using our proposed solutions), they can still disrupt learning; in contrast, an adversary with a large budget need not be as careful.
9 Proof of Theorems 1 and 6.6 (details)
9.1 Branching process approximation (Step 1 for proofs of Theorems 1 and 6.6)
9.1.1 Proof of Lemma 7.7
For , let denote the parameters in vector form, and let denote the all ones vector. We claim
[TABLE]
We prove (162) for ; the proof for follows the same approach. First, we use the parameter update equations (3), and the definitions of and from Appendix 7.1 ( being the column-normalized adjacency matrix and ) to write the parameter update equation in vector form as
[TABLE]
We next use induction. For , (162) is equivalent to (163). Assuming (162) holds for , we have
[TABLE]
which completes the proof. Next, recalling is the vector with 1 in the -th position and 0 elsewhere,
[TABLE]
where the equalities hold by definition, by (162), since the columns of sum to 1 by definition, and by multiplying numerator and denominator by , respectively. Next, recall from Section 2 that for some . Hence, is element-wise upper bounded by , so , where we have used column stochasticity of . Additionally, (since the three terms in the product are elementwise nonnegative). By a similar argument, . Taken together, we can use the previous equation to obtain
[TABLE]
Finally, recall from Section 2 that and are independent of . Hence, because as (by A4 in the statement of the lemma), as . It follows that, for given and sufficiently large, . Finally, by changing the index of summation, it is clear that , completing the proof.
9.1.2 Proof of Lemma 7.9
We begin by arguing . For this, first consider the sub-graph containing only edges between two agents formed during the first iterations of Algorithm 3. Conditioned on , this sub-graph is constructed as follows:
- •
The initial agent is sampled uniformly from (Line 3), which implies its degrees are distributed as . (In fact, this holds even if .)
- •
Each time an edge is added to the sub-graph (Line 3), the paired outstub is sampled uniformly from (else, is contradicted by Line 3-3), so the degrees of the corresponding agent are distributed as .
- •
The initial agent has no paired outstubs, while all other agents in the sub-graph have one paired outstub (otherwise, an outstub with label 2 was paired within the first iterations, contradicting by Line 3); in particular, the sub-graph has nodes and edges. Also, every agent in the sub-graph has a path to by the breadth-first-search nature of the construction, so, neglecting edge polarities, we obtain a connected graph with nodes and edges, i.e. a tree. Finally, since all edges point towards (see Line 3), the sub-graph is a directed tree pointed towards .
In summary, the sub-graph is a directed tree pointing towards an agent with degrees distributed as , in which all other nodes have degrees distributed as . This is precisely the procedure used to construct the sub-graph of agents during the first iterations of Algorithm 4. Additionally, Algorithms 3 and 4 add bots in the same manner (Lines 3-3 in Algorithm 3, Lines 4-4 in Algorithm 4). Taken together, we conclude that, conditioned on , the -step neighborhood into is constructed in the same manner in Algorithm 3 as the -step neighborhood into is constructed in Algorithm 4. Furthermore, by (58) and (60), it is clear that and , respectively, depend only on these respective neighborhoods, and on the signals and , respectively. Since the signals and are also defined in the same manner ( for ; for ), we ultimately conclude that and have the same distribution when holds.
We next argue occurs with high probability when holds. For this, we note that Algorithm 3 is nearly identical to the graph construction described in [33, Section 5.2]. More specifically, the only difference is that the construction in [33] does not include the pairing of agent instubs with bots in Lines 3-3 of Algorithm 3. However, these lines do not affect . Moreover, when A1 holds, the assumptions of [33, Lemma 5.4] are satisfied. This lemma states that, if and (with defined as in A1), then . In particular, by A2 we have for sufficiently large, with independent of ; substituting gives
[TABLE]
9.1.3 Proof of Lemma 7.13
We first claim that for and ,
[TABLE]
(Recall is the column-normalized adjacency matrix.) We prove (170) separately for and . When , the only case is (since ); if , the left side is clearly 1 and the right side is 1 by convention; if , the left side is 0 since ( has no outgoing neighbors in the tree). Next, we aim to prove (170) for and . For such , there is a unique path from to with length that visits the nodes . By definition of , it follows that
[TABLE]
On the other hand, if , no path of length from to exists, so . This proves (170).
Recalling that , we next claim that ,
[TABLE]
We prove (172) inductively: both sides equal when ; assuming (172) is true for , we have
[TABLE]
where in the first line we have used the definition of and the inductive hypothesis, the second line simply uses the distributive property, the third rearranges summations, and the fourth uses Pascal’s rule ( has subsets of cardinality ; that contain 1 and that do not contain 1). This completes the proof of (172).
Having established (172) and (170), we can combine them to obtain ,
[TABLE]
Finally, substituting the previous equation into (60), and recalling , we obtain
[TABLE]
which completes the proof.
9.2 Step 2 for proof of Theorem 1
9.2.1 Proof of Lemma 7.15
First, letting denote the degree sequence and denote the set of random variables defining the tree structure, we can use Lemma 7.13 to write
[TABLE]
where the first equality uses the tower property of conditional expectation and the fact that and are fixed given the tree structure, the second uses the fact that , and the third holds by the tower property and the definition of , i.e.
[TABLE]
Here we have also used the fact that is a random walk starting at the root of a directed tree; hence, for , is the probability of the lone path from to , which is , and for some . Next, using (180) and Lemma 7.19, we obtain
[TABLE]
where by convention the summation over is zero when . Adding and subtracting , the previous equation can be rewritten as
[TABLE]
where we have simply used the binomial theorem and computed two geometric series.
Next, we assume temporarily that as . By A3, we have for
[TABLE]
Hence, by , and since by A3, we have for , sufficiently large, and such
[TABLE]
where we have also used the fact that on by A3. Also, by A4, it is clear that , so for given and sufficiently large,
[TABLE]
Combining the previous four equations implies that for sufficiently large and ,
[TABLE]
We complete the proof for the case ; the proof for the other two cases is similar. In this case, we can use Lemma 9.31 from Appendix 9.4 to obtain for any and for large enough
[TABLE]
Combining the previous two equations gives for large and
[TABLE]
Hence, for given , we can find sufficiently small and sufficiently large such that, for , . This clearly also implies for such . On the other hand, for , it is trivial that . This completes the proof for the case .
We now return to the case . In this case, it follows from A4 that cannot occur, i.e. we need only consider the case . First, note that since and , we have for some and sufficiently large. For such , and for , we then obtain ; substituting into (182) (evaluated at ) gives
[TABLE]
where in the first inequality we used and , in the second we used (so that ), for the equality we used the binomial theorem and computed a geometric series, and the final inequality is immediate. Since are independent of , while as by A4, it is clear from this final expression that, for given , sufficiently large, and , . It follows that , completing the proof.
9.2.2 Proof of Lemma 7.17
First, suppose . Then, since (see (61) and the following argument), . Furthermore, since by A4, the fact that means only the case can occur. In this case, since by Lemma 7.15, we immediately obtain from that as well. Hence, it only remains to prove the lemma in the case , which we assume to hold for the remainder of the proof.
Towards this end, letting denote the degree sequence and denote the set of random variables defining the tree structure (as in Appendix 9.2.1), we have
[TABLE]
We next consider the two summands in (197) in turn. In particular, we aim to show that each summand multiplied by tends to zero as tends to infinity.
For the first summand in (197), we use the fact that the signals are i.i.d. Bernoulli() given the tree structure, as well as Lemma 7.13, to write
[TABLE]
where in the final step we have used and . It immediately follows that Hence, because as by A4, analysis of the first summand in (197) is complete.
For the second summand in (197), we first use the argument of (180) to write
[TABLE]
where we have defined and . Therefore,
[TABLE]
It remains to compute the variance and covariance terms in (203). First, for any , we note
[TABLE]
where we have used the argument of (181) and the fact that and are independent random walks given the tree structure. By a similar argument, . Hence, using Lemmas 7.19 and 7.21, and assuming for the moment that , we have
[TABLE]
Next, using (72) and Jensen’s inequality, we have
[TABLE]
and so , i.e. the second term in (212) is non-positive, so ,
[TABLE]
In the case , we have (again by Lemmas 7.19 and 7.21)
[TABLE]
where the inequality is (213) and ; hence, (214) holds for as well. Finally, since , it is immediate that (214) also holds for . We next analyze the covariance terms in (203). First, if , we can use (204) and Lemmas 7.19 and 7.21 to obtain
[TABLE]
On the other hand, if , we have , so . Hence, combined with (214), we have argued
[TABLE]
Hence, combining (203), (214), and (219), we obtain
[TABLE]
where the second inequality is simply , the first equality is immediate, and the second equality holds by definition of . It clearly follows that
[TABLE]
and so we can complete the proof by showing the right side of (224) tends to zero Clearly, the right side is zero if ; we aim to also show that, given , s.t. for and ,
[TABLE]
To prove (225), we first recall that by A3, we have for , . Hence, since we are assuming , and since by A3, we have for , sufficiently large, and such , . We thus obtain for large and ,
[TABLE]
To further upper bound the right side of (226), we note by the first equality in (213). The same argument gives Note, however, that to use the second bound, we must ensure . To this end, recall that for by A3. Hence, assuming we choose , we obtain for such . Thus,
[TABLE]
where the first inequality uses (226) and the bounds from the previous paragraph, the equalities are straightforward, the second inequality uses for by A3, and the third uses (recall we have chosen ). Finally, it is straightforward to see the final bound in (230) tends to zero with . Hence, for sufficiently small , (225) follows, completing the proof.
9.2.3 Notation for proofs of Lemmas 7.19 and 7.21
In the next two subsections, we prove Lemmas 7.19 and 7.21. For these proofs, we let denote the degree sequence , and we let denote a realization of this set. Note that the random variables defined in (72) are all functions of ; for a realization of , we let e.g. denote the realization of . We similarly define for realizations of , defined in (11). Finally, letting , we have by definition of . Hence, to prove Lemma 7.19, it suffices to show
[TABLE]
while to prove Lemma 7.21, it suffices to show
[TABLE]
9.2.4 Proof of Lemma 7.19
The case is trivial, since , so we assume moving forward. First, since is an absorbing set, we have , so
[TABLE]
For the first term in (234), we have
[TABLE]
where the second equality holds by Algorithm 4. More specifically, for , the degrees of are sampled from (Line 4 in Algorithm 4) after realizing (Line 4), yielding the term; further, is chosen uniformly from the incoming neighbors of (Line 4) after realizing the degrees of , yielding the term (the case is similarly justified). Combining (234) and (235), and using the fact that by definition, completes the proof in the case . For , we again use (234) and (235) to obtain
[TABLE]
which completes the proof.
9.2.5 Proof of Lemma 7.21
We begin by proving the first statement in the lemma, i.e. (232). First, we note that for the case, by definition, so , and the statement holds by Lemma 7.19. For the case, we first write
[TABLE]
where the first equality holds since is an absorbing set (i.e. ) and the second simply rewrites a conditional probability. Next, by the same argument as (235),
[TABLE]
where we have used the case of (235), since . Hence, the previous two equations give
[TABLE]
This completes the proof of (232). For the second statement, i.e. (233), the case is trivial, since by definition, so we assume for the remainder of the proof. First, let denote the first step at which the two walks diverge. Note that by definition, so ; also, due to the tree structure, the walks remain apart forever after diverging, i.e. Next, for , we write
[TABLE]
We begin by computing the second term in (245). Here we have
[TABLE]
where the first and last equalities hold by definition of and the second holds since is an absorbing set. Now for , we obtain
[TABLE]
where the first equality uses independence and eliminates repetitive events, and the third follows an argument similar to that following (235). Combining (247) and (254),
[TABLE]
Finally, by an argument similar to (254), we have
[TABLE]
Hence, combining (259) and (261) gives
[TABLE]
For the first term in (245), we first consider the summand. For , similar to (254),
[TABLE]
where in the final step we have also used (264). Similarly, for ,
[TABLE]
To summarize, we have shown
[TABLE]
Next, we consider the summands in (245) (such summands are present only for ). We have
[TABLE]
where in the first equality we used the fact that is an absorbing set and the fact that once the walks diverge they remain apart; in the second equality we used the fact that and are conditionally independent given the event . Further, for ,
[TABLE]
and so, combining the previous two equations and applying recursively yields
[TABLE]
where the final equality uses (273). Finally, combining (245), (264), (273), and (281) yields
[TABLE]
which is what we set out to prove.
9.3 Step 2 for proof of Theorem 6.6
9.3.1 Proof of Lemma 7.23
We first write
[TABLE]
where the first equality uses the law of total expectation and the second is immediate. For the first summand in the expectation in (286), we fix and write
[TABLE]
Here the first equality holds by monotonicity of , the first inequality is Markov’s, the second equality holds by (87), the second inequality uses Lemma 9.35 from Appendix 9.4, the third inequality uses , the third equality is immediate, the fourth equality again uses (87), and the fourth inequality uses (89). Since the preceding argument holds , we choose to minimize the bound. Upon substituting into (293), we obtain . The same argument holds for the second summand in the expectation of (286). We also note that the bound is non-random, so we may discard the expectation. In summary, we have shown
[TABLE]
Hence, for sufficiently large, we have by assumption on
[TABLE]
which is what we set out to prove.
9.3.2 Proof of Lemma 7.25
We begin by deriving a bound conditioned on the degree sequence. First, we fix and use monotonicity of and Markov’s inequality to write
[TABLE]
The bulk of the proof will involve bounding the expectation term. For this, we first note
[TABLE]
where the first equality holds by (87), the second rearranges summations, and in the third we have defined , , and . For the remainder of the proof, we use to denote conditional expectation with respect to the degree sequence and the set of random variables realized during the first iterations of Algorithm 4 (i.e. the random variables defining the first generations of the tree). Using this notation, we have
[TABLE]
where in the third equality we have multiplied and divided . Next, we note
[TABLE]
where in the first equality we rewrote the sum based on the construction of in Algorithm 4, in the second we have used the fact that for by Algorithm 4 (in words, and share the same ancestry in the tree), in the third we have recognized that the -th summand does not depend on , and in the fourth we have used (since ) and the construction of the agent offspring of in Algorithm 4. It follows that
[TABLE]
where holds by definition of in Algorithm 4 and of from (72). In summary, we have argued . On the other hand, we note , where the first inequality holds since is a sum of nonnegative terms and the second holds by (303) (using ), and where by definition. Hence, we can use Lemma 9.35 from Appendix 9.4 to obtain
[TABLE]
Substituting into (299) then yields
[TABLE]
We can then iteratively apply the preceding argument. Namely, we have
[TABLE]
(The precise form of the summations in (314) can be verified by considering the case in (312) and (313).) Note that the final step of the iteration is slightly different; this is because the root node has degrees sampled from (the uniform distribution) instead of (the size-biased distribution) in Algorithm 4. Nevertheless, a similar argument holds: here we have and , so by an argument similar to that leading to (306),
[TABLE]
Combining the previous inequality with (307) and (314) then yields
[TABLE]
Next, we recall by definition. Additionally, we have
[TABLE]
where the first equality uses the definition of , the second rearranges summations, and the third uses (182). Combining the previous two equations therefore yields
[TABLE]
Hence, recalling that , and substituting into (296), we have shown
[TABLE]
Clearly, this inequality still holds if we multiply both sides by . Additionally, by A3, for , where and ; since we additionally assume in the statement of the lemma, we conclude for and sufficiently large. For such , we can therefore write
[TABLE]
where the second inequality uses Lemma 9.33 from Appendix 9.4. Additionally, since , we can use the argument leading to (195) to obtain (for some independent of ) whenever and is sufficiently large. For such , we obtain
[TABLE]
Now since was arbitrary, we can choose . Upon substituting into the exponent in the previous equation, this exponent becomes
[TABLE]
where the inequality simply uses and (for large ). Now note that since , we have (for example) for sufficiently large. Additionally, since , we have (for example) for sufficiently large. Combining these observations, we can upper bound (331) as
[TABLE]
Hence, substituting into (327) gives
[TABLE]
Finally, we write
[TABLE]
where the first equality is the law of total expectation, the inequality uses (333) and upper bounds a probability by 1, and the second equality uses the assumptions in the statement of the lemma.
9.3.3 Where the proof fails in the general case
As shown in Appendix 7.4.1, extending Theorem 6.6 to the case amounts to showing that for some ,
[TABLE]
where is the appropriate limit from (104). Here we show (roughly) why the approach from the preceding proof fails to establish (336) in the case . To begin, we note we first used the assumption following (323). Hence, in the case , we can still follow the approach leading to (323) to obtain the (one-sided) bound
[TABLE]
where the approximate equality uses on by Lemma 7.15. We next note
[TABLE]
where the inequality discards nonnegative terms, the first equality is by definition of , the second rearranges summations and multiplies/divides by , and the third uses (182). Hence, we have shown (339) is (roughly) lower bounded by
[TABLE]
where we have also used for large on when by A3. Now we consider three cases for the exponent in the previous expression:
- •
: Here Lemma 7.15 states for large on ; for such , the exponent is roughly
[TABLE]
where the inequality holds for large (so that , which holds since ) and the equality holds by choosing the minimizing (namely, ). Since this lower bound is constant in , (339) does not decay as grows.
- •
: Here Lemma 7.15 states for large on . An argument similar to the previous case shows (339) does not decay as grows.
- •
with : Here we consider an example to show (339) does not decay sufficiently quickly for the general case. In particular, we assume for some constant that satisfies the theorem assumptions and we set . Then since per A3, we have e.g. for large . Hence,
[TABLE]
where the first inequality holds by (189) in Appendix 9.2.1 (where are arbitrarily small, hence the approximate inequality), the second holds for our chosen , and the third holds for some constant and for large . Hence, the exponent is (roughly) lower bounded by
[TABLE]
where the equality holds for the minimizer . From here it follows that (339) cannot be : if it is, we have for all large and for some constant ,
[TABLE]
The final inequality is a contradiction, since as .
9.4 Auxiliary results
In this appendix, we collect several auxiliary results used in other proofs. (These results are either cited from other sources, or their proofs are computationally heavy but elementary, so we collect them here to avoid cluttering other parts of our analysis.)
Lemma 9.31**.**
For , , and s.t. , we have
[TABLE]
Proof 9.32**.**
We consider the three cases of (348) in turn; the proof of (349) follows the same approach.
First, suppose . Then since and , we have for sufficiently large , which implies for such . Clearly, we also have for all . Taken together, it follows that for large. For such , we can then write
[TABLE]
where we used in the denominator. Now since and , , so taking in the above inequality gives the result.
Next, suppose . Since by and , it suffices to show as . First, since , s.t. . Further, since , s.t. . Hence, ,
[TABLE]
Next, we note
[TABLE]
Hence, s.t. ,
[TABLE]
Combining these arguments, we obtain
[TABLE]
Since both bounds converge to as , follows.
Finally, suppose . First, we observe
[TABLE]
where the inequality holds for s.t. (which indeed occurs for large ; see proof of case), since then the sum is over terms, each upper bounded by 1. On the other hand, we can use the binomial theorem to write
[TABLE]
Next, we observe (assuming as above)
[TABLE]
where the first inequality replaces negative terms with positive ones; the second inequality uses , , and for ; and the third inequality upper bounds the summation by replacing its upper limit with infinity. Hence, (355), (356), and (358) yield
[TABLE]
where the final equality holds since by assumption.
Lemma 9.33**.**
Let . Then for any ,
[TABLE]
Proof 9.34**.**
For , define . Then
[TABLE]
Assuming temporarily that whenever (which we will return to prove),
[TABLE]
Hence, using the previous two equations, we obtain , i.e. the sequence decreases in . It is also clearly nonnegative. Therefore,
[TABLE]
To further bound the right hand side, we note
[TABLE]
where the first equality uses the definition of , the second rearranges summations, the third uses the binomial theorem, the fourth is immediate, the inequality is immediate, and the final equality computes a geometric series. Combining the previous two inequalities proves the lemma.
We return to prove whenever . For this, we first claim
[TABLE]
We prove (369) by induction on . First, when , the only case to prove is ; when , it is immediate that both sides of (369) equal . Next, assume (369) holds for . If , both sides of (369) equal . If , we write
[TABLE]
where the first equality simply writes the final summands separately, the second uses the inductive hypothesis on the term in parentheses, the third is immediate, the fourth uses Pascal’s rule ( has subsets of cardinality ; that contain 1 and that do not contain 1), and the fifth is immediate. This establishes (369). We then write
[TABLE]
where the first equality holds by definition of , the second adds and subtracts a term, and the third uses (369). This shows , iterating gives whenever .
Lemma 9.35**.**
Let be a random variable satisfying and , and let . Then
[TABLE]
Proof 9.36**.**
See e.g. [35, Lemma 5.1].
10 Belief convergence results
In this appendix, we prove two belief convergence results. We first discuss some basic tools used in both proofs. To begin, fix and , and let be a probability measure over . Then clearly,
[TABLE]
where is distributed as in the final expression. Also, for any , since , we clearly have
[TABLE]
Furthermore, for any such that , by convexity,
[TABLE]
which implies
[TABLE]
Combined with (380) and (381), we obtain
[TABLE]
Next, we recall some notation and basic results from Appendix 7.1 and 9.1.1. Denote by and the parameters and in vector form. Set , where is the graph’s column-normalized adjacency matrix (normalized so each column sums to ). Let be the all ones vector. Then (see (162) in Appendix 9.1.1)
[TABLE]
Hence, by column stochasticity of , we obtain the following componentwise inequality:
[TABLE]
Finally, for any , we let denote a random variable. Thus, recalling the expressions for the mean and variance of the beta distribution, for any , Chebyshev’s inequality implies
[TABLE]
In particular, for any and , since by definition, the previous two inequalities imply
[TABLE]
Hence, because is the distribution, we can use (384) to obtain the following:
[TABLE]
10.1 Proof of Proposition 1
Fix . We first show Let . Recall for and for , so componentwise. Combined with (385) and (386), we obtain
[TABLE]
Hence, because is bounded independent of , it suffices to show that for any and all large,
[TABLE]
Toward this end, we first observe that for any , i.e., is a set of absorbing states in the Markov chain with transition matrix . By the assumption of the proposition, all agents can reach this set. Taken together, we have an absorbing Markov chain with absorbing states and non-absorbing states . It follows that as . Hence, we can find such that whenever . Thus, for any , we obtain the desired inequality (391):
[TABLE]
Next, we show Fix . Since , we can find such that
[TABLE]
Hence, if we let and , then , so by (389) and (393), for any ,
[TABLE]
10.2 Proof of Corollary 1
Fix and . Similar to the proof of Proposition 1, we set and . Then by (389),
[TABLE]
Hence, because as by A4, we conclude that for all sufficiently large,
[TABLE]
Combined with Theorem 1, we thus obtain
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. Shearer and J. Gottfried, “News use across social media platforms 2017,” Pew Research Center, Journalism and Media , 2017.
- 2[2] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,” Journal of Economic Perspectives , vol. 31, no. 2, pp. 211–36, 2017.
- 3[3] P. Savodnik, “‘You start seeing the dreaded sensitivity label’: Is a pro-Trump Twitter army strategically throttling Biden ads?” https://www.vanityfair.com/news/2020/10/is-a-pro-trump-twitter-army-strategically-throttling-biden-ads, 2020.
- 4[4] A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, “Non-bayesian social learning,” Games and Economic Behavior , vol. 76, no. 1, pp. 210–225, 2012.
- 5[5] C. Shao, G. L. Ciampaglia, O. Varol, K.-C. Yang, A. Flammini, and F. Menczer, “The spread of low-credibility content by social bots,” Nature communications , vol. 9, no. 1, p. 4787, 2018.
- 6[6] M. Azzimonti and M. Fernandes, “Social media networks, fake news, and polarization,” National Bureau of Economic Research, Tech. Rep., 2018.
- 7[7] B. Golub and M. O. Jackson, “Naive learning in social networks and the wisdom of crowds,” American Economic Journal: Microeconomics , vol. 2, no. 1, pp. 112–49, 2010.
- 8[8] D. Acemoglu, A. Ozdaglar, and A. Parandeh Gheibi, “Spread of (mis) information in social networks,” Games and Economic Behavior , vol. 70, no. 2, pp. 194–227, 2010.
