Gaps in Information Access in Social Networks
Benjamin Fish, Ashkan Bashardoust, danah boyd, Sorelle A. Friedler,, Carlos Scheidegger, Suresh Venkatasubramanian

TL;DR
This paper addresses the issue of unequal information access in social networks by proposing a maximin welfare approach to ensure fairer dissemination, analyzing its theoretical properties, computational challenges, and practical effectiveness.
Contribution
It introduces the maximin social welfare function for fair information spread, proving its effectiveness in reducing access gaps and providing practical algorithms.
Findings
Maximin welfare constrains access gaps effectively.
Maximizing expected reach does not reduce access disparities.
A simple greedy strategy performs well empirically.
Abstract
The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distributed through the network, poorly connected individuals are systematically less likely to receive the information, producing a gap in access to the information between individuals. In this work, we study how best to spread information in a social network while minimizing this access gap. We propose to use the maximin social welfare function as an objective function, where we maximize the minimum probability of receiving the information under an intervention. We prove that in this setting this…
| Name | Nodes | Edges | Direction |
|---|---|---|---|
| EU (Leskovec et al., 2007) | 803 | 24729 | Directed |
| Arenas (kon, 2017; Guimerà et al., 2003) | 1133 | 5451 | Directed |
| Irvine (Opsahl and Panzarasa, 2009) | 1294 | 19026 | Directed |
| Facebook (Leskovec and Mcauley, 2012) | 4039 | 24729 | Undirected |
| ca-GrQc (Leskovec et al., 2007) | 4158 | 13428 | Undirected |
| ca-HepTh (Leskovec et al., 2007) | 8638 | 24827 | Undirected |
| Algorithm | Average time (s) | ||
|---|---|---|---|
| Arenas | EU | Irvine | |
| Random | 0.007 | 0.015 | 0.012 |
| Gonzalez | 0.021 | 0.031 | 0.033 |
| Naïve Myopic | 0.086 | 0.208 | 0.184 |
| TIM+ | 0.876 | 1.826 | 1.046 |
| Myopic | 8.910 | 19.438 | 16.755 |
| Greedy | 507.35 | 759.296 | 1399.26 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Gaps in Information Access in Social Networks
Benjamin Fish
Microsoft Research
,
Ashkan Bashardoust
University of Utah
,
danah boyd
Data & Society
,
Sorelle A. Friedler
Haverford College
,
Carlos Scheidegger
University of Arizona
and
Suresh Venkatasubramanian
University of Utah
(2019)
Abstract.
The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distributed through the network, poorly connected individuals are systematically less likely to receive the information, producing a gap in access to the information between individuals. In this work, we study how best to spread information in a social network while minimizing this access gap.
We propose to use the maximin social welfare function as an objective function, where we maximize the minimum probability of receiving the information under an intervention. We prove that in this setting this welfare function constrains the access gap whereas maximizing the expected number of nodes reached does not. We also investigate the difficulties of using the maximin, and present hardness results and analysis for standard greedy strategies. Finally, we investigate practical ways of optimizing for the maximin, and give empirical evidence that a simple greedy-based strategy works well in practice.
fairness; influence maximization; social networks
††journalyear: 2019††copyright: iw3c2w3††conference: Proceedings of the 2019 World Wide Web Conference; May 13–17, 2019; San Francisco, CA, USA††booktitle: Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA††doi: 10.1145/3308558.3313680††isbn: 978-1-4503-6674-8/19/05††ccs: Networks Online social networks††ccs: Information systems Social recommendation††ccs: Theory of computation Graph algorithms analysis
1. Introduction
Information flow in networks has been a subject of extensive study. Among the many motivations for the study of how information propagates in a network has been advertising (how can we spread information most effectively on a budget) and clustering (how do groups form and organize in a network).
One of the most important questions in this area is how to maximize influence in a social network. Here the goal is to choose where to place initial sources of information so as to maximize the flow of information via word-of-mouth. First formalized by Kempe, Kleinberg, and Tardos (Kempe et al., 2003), there has been a long series of work in the literature on influence maximization.
However, this work has not typically focused on the impact that the information has on the individuals in the network. For example, one important application of information flow in networks is for recruitment. Social networks like LinkedIn are increasingly used to provide access to jobs and information that can greatly impact an individual’s career development. Often just as important as the individuals themselves are the connections between individuals – their social networks – in making hiring decisions. This is because information transmitted amongst social networks may accrue amongst the best-connected individuals in the network. As the adage goes, “it’s not what you know, but who you know.” With more and more of our social life mediated through online networks, the role that networks play in opening up opportunities is increasingly important. This includes not only recruitment, but also advertising and other kinds of marketing.
However, network structure can create haves and have-nots in the game of access. Insiders who are well-connected in the network have easier access to relevant information about opportunities for advancement that can in turn lead to even better connections. Outsiders who lack access to such information will find it much harder to improve their network status. This access gap may lead to a form of inequality that is different from the traditional forms of inequality based on class, race, gender, or other attributes, but nonetheless provides a significant challenge.
Thus, we are concerned with each individual’s access to information and not just the number of people reached or the amount of information being distributed. How might we ensure that the access gap in information is reduced? Rather than asking how far we can spread information on a budget, we instead ask which people are getting the information we’re spreading.
1.1. Our Work
How can we formulate a notion of equitable access to information in a network, and how might we intervene in a network (on a budget) to minimize the gap in access to information? In particular, we examine how best to add seeds (individuals who start with the information) to a network to minimize this gap in access.
We propose a new measure of access in a network. In contrast to previous work that maximizes the average probability that an individual receives the information (max reach), we instead propose to maximize the minimum probability. We formalize access as a social welfare function that assigns a real value to the set of utilities received by the individuals, in this case the probabilities of receiving the information. This allows us to evaluate the notions of access themselves: we consider a notion of access to be better if interventions that optimally maximize that notion do not widen the access gap. We show that every notion of access (amongst a wide class of such functions) does to some degree permit the access gap to increase in the worst case. On the other hand, if the access gap increases between two groups of individuals after an intervention, we show that our proposed notion of access at least prohibits situations where the access does not increase at all for the group which started off with less access to the intervention. Perhaps surprisingly, we show in Section 3 that a very large class of natural notions of access (including maximum reach) does not have this very basic prohibition. We desire this because without such a prohibition, in the worst case there’s nothing stopping interventions from creating one permanently and significantly advantaged group with access to information and one group without any such access, which we regard as blatantly undesirable.
We show that maximizing the minimum probability is NP-hard, hard even to approximate well, and moreover that a number of standard greedy strategies have asymptotically worst-possible approximation ratios. Nonetheless, we show via experiments that a very simple greedy strategy performs well in practice: namely, choose the seeds to be the vertices currently estimated as having the smallest probabilities of receiving the information. We also demonstrate that by using this strategy, we decrease the correlation between vertices’ probability of receiving the information and their location in the network, indicating that our measure of access is not merely a proxy for (static) network structure.
Limitations
We recognize that asking to maximize the minimum probability of access to information ignores the fact that not all individuals in a network might need a particular piece of information. For example, a hiring ad should be spread widely, but only to candidates who are eligible, are in the right geographic areas, and have desirable qualifications. More generally, interventions to improve access to information might themselves cause feedback loops (both virtuous and vicious): our work does not consider those dynamics. Nor does our work consider other notions of utility, like those that take into account the benefits of receiving the information more than once. We leave study of these issues for future work.
In summary, our main contributions are as follows.
- •
We propose a new measure of information access in a network. We demonstrate that this measure captures certain axiomatically desirable properties of any notion of equal access, and further that existing notions including the well-studied maximum reach concept do not.
- •
We investigate the problem of maximizing access theoretically, presenting hardness results as well as analysis of standard greedy strategies.
- •
We do a comprehensive empirical evaluation of heuristics for achieving a high level of access, demonstrating that a greedy-based strategy is quite effective at improving equality of access in a network for a given budget of interventions.
1.2. Related Work
Granovetter’s seminal work on the strength of weak ties (Granovetter, 1977) first broached the idea that network position can confer advantages or disadvantages (including in hirings scenarios). Indeed, weak ties can influence success in hiring and careers (Granovetter, 1983). In an algorithmic setting, boyd, Levy, and Marwick (Boyd et al., 2014) illustrate how modern social networks like LinkedIn might be vehicles for a more direct propagation of advantage and disadvantage. In that light, our work, which focuses on how to mitigate such effects in the context of information access, falls into the paradigm explored by fairness-aware decision-making in which the goal is to design decision-making systems that ensure the end result is non-discriminatory to individuals or groups of individuals. Our work can be viewed as an attempt to quantify one aspect of social capital, a notion introduced by Coleman (Coleman, 1988) to capture how social standing within a system could be interpreted as a resource that has utility for an agent. Recently, Benthall and Haynes (Benthall and Haynes, 2019) consider how to use a social network to define racial aspects of social standing, but don’t consider interventions in the social network.
Rather than directly model an explicit fair goal for a decision in this setting, via assuming we have access to a sensitive feature like race on which we would focus our attention, we instead model the utility that each individual receives. This formalizes how best to optimize for access to information without necessarily requiring equal access. While most of the literature in algorithmic fairness uses equality-based definitions (Dwork et al., 2012; Romei and Ruggieri, 2013; Fish et al., 2016; Hardt et al., 2016; Feldman et al., 2015; Zafar et al., 2017; Narayanan, 2018) (typically either group fairness or individual fairness), the welfare approach to fairness that we use is starting to become more popular. For example, Heidari et al. (Heidari et al., 2018) propose a specific welfare function to use for classification and regression problems.
Our choice of welfare function is based on axiomatic considerations: by determining which functions satisfy specific mathematical criteria used to model gaps in access. The resulting function that seeks to maximize the minimum probability of receiving information bears some resemblance to the difference principle outlined by Rawls (Rawls, 2009), in that it seeks to intervene so as to provide benefit to the “least-advantaged”, here interpreted as those with the least probability of access.
Our work relies on a framework for information propagation that comes from the broad area of influence maximization. Influence maximization seeks ways to spread information in a network efficiently using a small collection of seeds. The typical measure of information spread used is the expected number of nodes that receive the information (the max reach measure). While influence maximization assigns the same utility to an individual as we do, the welfare function in that setting is just the sum of the individual utilities. This utilitarian approach was initiated by Domingos and Richardson (Richardson and Domingos, 2002) and is formalized as a discrete optimization problem in Kempe, Kleinberg, and Tardos (Kempe et al., 2003). There is also work into making this process faster (Chen et al., 2009; Tang et al., 2014a) or suitable for more general situations, where factors like pricing must be taken into account (Arthur et al., 2009).
A related body of algorithmic work (Garimella et al., 2017; Matakos et al., 2017; Musco et al., 2018) posits that one way to decrease polarization in social networks is to connect people with opposing views by exposing them to new information. Such work differs in focus and approach to modeling from this work because that work is concerned with poor connectivity between communities and we are concerned with individuals who are simply poorly connected.
2. Definitions
Let be a graph with nodes. To describe information flow in we will use a standard probabilistic model for how information travels – the independent cascade (IC) model (Kempe et al., 2003). In this model, a node either possesses information or not. A set of seed nodes start out with the information, and information flow proceeds in rounds. Each newly informed node informs its neighbors in the next round i.i.d. with probability of transmission . Once a node is informed, it stays informed, and no longer passes on the message. In this work, we will use the IC model with a fixed probability of transmission.
Welfare Functions
In the IC model with parameter , we can associate with each vertex the probability that is informed after all information has been passed. We now define a social welfare function to represent how effectively information is spread: it takes as input the probability of infection for each vertex, and outputs the overall welfare.
Definition 1.
The welfare of a set of vertices in with seed set is . If is all vertices, we abbreviate this as .
When the graph is clear from context, we will omit the subscript and write and respectively.
Seed sets represent an intervention in the information network. Thus, a primary goal in the study of information flow is to find a budgeted intervention: a set of seeds of size no more than for a given graph (possibly with initial seeds ) with maximum welfare
[TABLE]
In other words, is the initial seeds along with a set of vertices which maximizes access for . Later, we will also consider the set of seeds that maximize welfare for a particular set of vertices:
[TABLE]
Kempe, Kleinberg, and Tardos (Kempe et al., 2003) and subsequent work use as their welfare function reach, the expected number of nodes reached. In our notation, and normalizing to make it conveniently -valued, this becomes the following:
Definition 2 (Reach).
.
We can easily generalize this to a wider class of notions of welfare. We consider generalized means:
Definition 3 (-mean).
.
Note in the limit, this becomes the geometric mean for , the minimum for , and the maximum for . In other words, .
We say that a function , each representing the probability that a node receives the information, is monotonically increasing if when for all . A function is strictly monotonically increasing if when for all and in addition there is some such that . is symmetric if for all permutations .
In this work, we restrict our attention to symmetric, monotonically increasing welfare functions so that no vertex is privileged above the others and, all else equal, increasing an individual’s probability of receiving the information is never undesirable. The -means are such functions. Moreover, if a continuous welfare function satisfies four natural conditions (symmetry, strictly monotonically increasing, independence of unconcerned agents, and independence of common scale111Independence of common scale means that the ordering over alternatives should not change when multiplying each probability by a common positive factor, and independence of unconcerned agents means that the ordering should be independent of a probability that doesn’t change, i.e. if , then for all . ) as a consequence of the Debreu-Gorman theorem (Debreu, 1959; Gorman, 1968) the only such welfare functions up to ordering over preferences are the -means (Heidari et al., 2018; Roberts, 1980), as long as all probabilities are non-zero. In other words, at least in the case of connected undirected graphs, -means are an extremely wide class of symmetric, monotonically increasing welfare functions, making them a natural class to examine.
3. Gaps in Access
Optimizing a welfare function is a way to improve access to information in the aggregate. But our concern in this work is whether individuals or subgroups are being left behind in the process. Is it possible that even though an aggregate measure of information access is increasing, the gap in information access between groups is getting larger? In this section, we will focus on evaluating welfare functions with respect to information access properties we would like to ensure.
We now define the access gap, which captures how much better some individuals are doing than others.
Definition 4.
The access gap of a (non-trivial) partition of the vertices of a graph with seed set under a welfare function is
[TABLE]
Note we only define the access gap over bipartitions, rather than arbitrary subsets. This is to prevent the following situation: Given a partition of and initial seed set , suppose are both very large, but is much smaller. Consider , the optimal seed set for this graph, and suppose now . We now have a gap between the access of and , but this gap was a by-product of significantly increasing the access of . Since this may well be desirable behavior, we preclude this situation by only considering gaps between bipartitions.
In particular, we want to know when the access gap increases. We call this the rich getting richer phenomenon.
Definition 5 (Rich get richer).
In a graph with initial seeds under a welfare function , we say that the rich get richer if there is a (non-trivial) partition where the optimal intervention satisfies
[TABLE]
Unfortunately, stopping the rich from getting richer in arbitrary graphs may be too much to hope for. Even simple examples show that under many notions of welfare, including all -means, the rich get richer.
Proposition 3.1.
Suppose is symmetric, increasing, and satisfies the following condition: For any , …, in , there is some such that
[TABLE]
Then under , when , there exists a graph and initial seed set where the rich get richer.
Note that the upper bound in this third condition is easy to satisfy; it suffices that is strictly less than when not all of the are equal to each other. In addition the assumption that is only assumed for the sake of convenience, since is monotonic in .
Proof.
Consider the example graph in Figure 1 and suppose . Let and . Note that , , and . Then . Yet , so we have .
What is the optimal seed to add? If we add to the seeds, then we have and . Otherwise, if we add to the seeds, then , , and . Note by symmetry and monotonicity, so without loss of generality the optimal modification is to make a seed. Then it is easy to calculate . Thus we have the rich getting richer if . But , so it suffices to show that . Then since and ,
[TABLE]
∎
Proposition 3.1 holds for all -means for . We will show in Section 3.1 that the rich get richer not only for the -mean but a whole other class of welfare functions as well (a consequence of Proposition 3.2). Given this, keeping the rich from getting richer appears to be too much to hope for.
3.1. -imbalance
If we can’t keep the rich from getting richer in the worst case, what can we prevent? A particularly concerning case of the rich getting richer is when the access of the worse-off group doesn’t improve at all. That is, a case where under the initial seeds and the rich get richer, but for the set of seeds that maximize welfare . This might not be so bad if the only way to improve the access of is to increase the access of to the point where it is even higher than that of , so that becomes the worse-off group. On the other hand, this situation becomes particularly egregious when in addition , i.e. the optimal improvement for still does not improve the access of to the point where it is larger than the access that started out with prior to intervention (recall that – defined in Section 2 – is the seed set that maximizes reach for ). If this can happen when adding seeds, we will call -imbalanced. That is, -imbalance is a particularly egregious form of the rich getting richer. If is not -imbalanced for any , we will call it balanced.
We believe that balance is a natural desideratum because it prevents interventions from never helping the worse-off group at all. Stronger versions of preventing disparity in access may still be preferred, like avoiding the rich from getting richer, so balance may only represent a necessary but not sufficient condition for preventing disparity. In this section, we show a wide class of are -imbalanced, but that is balanced.
Definition 6 (-imbalance).
A welfare function is -imbalanced if there exists a graph with initial seed set and partition of the vertices and where the optimal intervention and optimal intervention for under the addition of no more than seeds satisfies the following:
- (1)
* (There is a set of seeds to add that improves the access of .)* 2. (2)
* (Not only does start off with more access than , but starts off with more access than can possibly achieve.)* 3. (3)
* (The access of improves.)* 4. (4)
* (The access of does not improve.)*
In other words, a welfare function is imbalanced if
[TABLE]
Note that it is immediate that if is -imbalanced for any , then the rich get richer under . As increases, it should be the case that it becomes more difficult to find examples of imbalance, as it is harder to avoid improving the access of . Nonetheless, we can show that a wide class of welfare functions, including reach, is -imbalanced:
Proposition 3.2.
Suppose is symmetric and strictly increasing. Then is -imbalanced.
Proof.
It suffices to consider the simplest case: when there is no communication, i.e. is the disjoint graph of vertices. and will each be exactly half of the vertices (for even). The initial seed set will be entirely contained in and will be size . Now we will add an additional seeds. Note first that since is symmetric, each of the vertices (with the exception of the initial seeds) are identical. So is any set of additional seeds in : each additional seed must improve the welfare of because is strictly increasing. But in this case, and become identical, so we have . But by symmetry, the optimal seeds to add can be any vertices, in which case we can assume they are all in . Thus the welfare of does not increase while the welfare of does. ∎
It turns out that balance is a useful definition, insomuch as it is actually possible to achieve.
Proposition 3.3.
* is balanced.*
Proof.
Suppose is imbalanced, witnessed by some partition of and initial seed set . Recall imbalance implies that . Then by definition of , the vertex with minimum probability is in , i.e. . Remember maximizes the minimum probability, and , so there is at least one graph that increases that minimum probability, which in turn means that does as well. Thus , a contradiction. ∎
On the other hand, is a special case, and every other -mean is maximally imbalanced: there exists a graph, initial seed set, and partition of the vertices that verifies the other -means are imbalanced.
Proposition 3.4.
For is -imbalanced.
Proof.
If , so is the maximum probability, then as soon as the graph has at least one seed, then , and any added seeds after that don’t change the value, so is trivially -imbalanced. Otherwise, if , is strictly increasing, and from Proposition 3.2 we know it is -imbalanced. And if , then is strictly increasing once all probabilities are non-zero, at which point we use a similar tactic to when is strictly increasing, except we will need a connected graph. We will use the star graph, with one central vertex the seed, and all other vertices connected to that seed. In addition there will be some additional seeds, all in , which consists of those seeds, the central seed, plus more vertices. will be the other vertices, so that is nodes. Our goal will be to add an additional seeds. Remember, since is connected (all vertices have non-zero probability) is strictly increasing. Then the optimal graph for is to add all additional seeds to , in which case we have vertices with probability 1 and vertices with probability . But in is exactly the same, so we have . However, all non-seeds are isomorphic, so we may assume all new seeds are added to . ∎
We note that one could consider many variations of -means, including replacing mean with median, maximum with minimum, etc. These variations do not affect the results that we present here. We defer a detailed analysis of these variations to the full version of the paper.
4. Maximin access
The previous section established as a better access measure than others, at least when it comes to achieving balance. We now study the problem of maximizing , which we call the maximin access problem. We start by showing that this is NP-hard even to approximate well.
Theorem 4.1.
Suppose . Then choosing seeds to maximize min access is NP-hard. In this case, the maximin access cannot be approximated better than and if furthermore then the maximum cannot be attained efficiently without an additional factor seeds.
Proof.
We reduce from Set Cover, where an instance is defined by a collection of subsets over a ground set and an integer , and the decision problem is whether or not there is a collection of subsets whose union is . Further, we can assume . Given such an instance, we construct a directed graph (example showed in Figure 2). We start with the natural directed bipartite graph corresponding to a set cover instance, where there is a vertex corresponding with each set and a vertex corresponding with each element . There is a directed edge from to whenever is contained in . We then add a single extra vertex and directed edges from to each vertex corresponding with one of the sets, and ask to maximize the minimum probability by adding seeds.
Since has in-degree zero, in order for the maximin access to be greater than zero, must be chosen as a seed. In this case, since , regardless of which seeds are chosen, there is some set such that . Therefore the maximum min access is no more than . Without loss of generality, no vertex corresponding to an element need be chosen as a seed. Otherwise, the seed may be moved to any vertex corresponding with a set such that . The maximin access cannot go down, because we still have .
If there is a set cover, then the maximum min access is at least : choose the vertices corresponding to the cover for the seeds (plus ), in which case , because they are either seeds or distance one from , and , because they are distance one from a seed. If there is no set cover, then there is no way to choose the seeds amongst the such that all vertices are within distance one from a seed. Assume that every element is contained in at most two subsets amongst the (this is now the Vertex Cover problem, an NP-hard special case of Set Cover). So there must be some such that . Thus when , i.e. , any algorithm that maximizes the min access chooses the set cover if there is one. So any algorithm that has an approximation ratio strictly better than must in fact be exact, and therefore also find the set cover.
Even in the general case of Set Cover, we can still distinguish between when there is and is not a set cover: The existence of a set cover still means the maximin probability is at least , while the lack of a set cover implies there is at least one vertex with probability no more than , which is upper-bounded by when . Therefore, since set cover is -inapproximable, we cannot approximate the best seeds to add without an additional -factor seeds. ∎
Moreover, if we can find the seeds that maximize the minimum probability, even approximately, we can boost this result to also compute the minimum probability itself approximately. This serves as additional evidence that this problem is hard, as there is no known method to even approximately compute the minimum probability.
Proposition 4.2.
If there is an -approximation algorithm for maximin access, there is an -approximation for the minimum access of a vertex in a graph given a seed . That is, if the minimum access is in , then we can given an estimate such that
[TABLE]
Proof.
Given an instance , we construct a graph similar to the one in Figure 5. (We may assume that is connected.) If the diameter of is , add to a simple undirected path of length starting from , and call it . Call the end of this path . In , , which means that if we compute the single optimal seed in , it must be on the path from to .
Define so that , i.e. . Then the optimal placement for a seed is at distance from , where , because for any we have and , where denotes probabilities in .
Suppose that we have a -approximation algorithm for maximin access, and it chooses some seed distance from (we may assume that the seed is on the simple path, otherwise we may always choose ). Since it is a -approximation on a simple path, must be within of . Now we can approximate using as an estimate of : We estimate it as .
Then , and likewise , so this is within of .
∎
4.1. Maximin algorithms
The above results imply that it is hard to maximize even approximately. Nonetheless, Theorem 4.1 still leaves open the possibility of an -approximation (for fixed number of seeds and ). In this section, we present the heuristics we will use, along with a few baselines. We will show in Section 4.2 that, unfortunately, these natural heuristics have a worst-possible approximation ratio (a ratio exponential in ). These results do not preclude good performance in practice, which we discuss in Section 5.
Making our task yet more challenging is that, unlike maximizing reach (Kempe et al., 2003), maximin is not a submodular objective.222This can be seen using the construction in the proof of Proposition 4.4, starting with one seed in the center of a simple path. Adding one additional seed then does nothing, but adding two seeds increases the minimum probability. Nonetheless, it is natural to try a greedy approach, where in each iteration, we add to the seeds the vertex that maximizes the objective function. We refer to this heuristic as Greedy (Algorithm 1). To do this, we use the simple approach of estimating each probability for every possible vertex to add to the seed set. (See below for details on how we estimate these probabilities.)
We contrast this approach to the faster approach, which we will call Myopic (Algorithm 2), whereby we instead in each round choose the vertex with the currently smallest probability as the new seed, without actually evaluating the new value of the objective function.
We also consider a naïve variation (Naïve Myopic, Algorithm 3) which, instead of proceeding in rounds, given initial estimates for the probabilities, picks for the seeds the vertices with the smallest probabilities.
So far, we have omitted how to estimate the probabilities for each vertex. Unfortunately, computing the probability for each vertex exactly is #P-hard (Provan and Ball, 1983). Even computing probabilities of receiving the information with a guaranteed approximation ratio is a long-standing open problem (Karger, 1999). So in this paper, we use a Monte Carlo method, simulating the IC model a fixed number of times, and estimating the probabilities for each vertex as the percent of times the information reached that vertex under the simulations (Algorithm 4). Of course, this requires having at least one seed, which is not the case in the first round of Myopic and Naïve Myopic. So we always choose the first seed as vertex with the highest degree. This approach for dealing with the first round, as well as estimating the probabilities, provides a simple way to compare these heuristics. As such, for the experiments we also choose the first seed as the highest degree vertex for the Greedy heuristic as well, again to simplify comparison. We leave for future work other approaches for these issues.
An alternative approach that avoids estimating probabilities is to pick seeds that are far from each other, under the intuition that a node far away from the current seeds is likely to have a small and therefore should be picked as the next seed. The resulting heuristic is to pick in each round the node that is furthest from the current set of seeds as the next seed; we call this heuristic Gonzalez because of its resemblance to the well-known algorithm for -center clustering (Gonzalez, 1985).
One could choose other proxies for the utility such as nodes of low degree (or high degree), or nodes that do not contain seeds in a fixed radius ball around them. In our experiments with these heuristics, they were dominated in both quality and performance by the ones mentioned above, and we will not discuss them further.
4.2. Approximation ratios of maximin algorithms
We now show that Myopic, Naïve Myopic, Greedy, and an exact version of Gonzalez all have approximation ratios that are exponential in , even if we assume the probabilities required by Myopic, Naïve Myopic, and Greedy can be estimated exactly. This is to emphasize that their poor behavior in the worst case doesn’t just stem from the difficulty of approximating the probabilities given a seed set, but the heuristics themselves.
4.2.1. Myopic and Naïve Myopic
Note that in the case , Myopic and Naïve Myopic are identical algorithms. Thus we can show that in this case, both algorithms behave poorly in the worst case, even in the non-trivial case when we start with at least one initial seed.
Proposition 4.3.
Given a graph and non-zero initial seed set, choosing as the seed with smallest yields a solution with approximation ratio no better than .
Proof.
Consider the graph depicted in Figure 3. If we are allowed to add only a single additional seed besides the initial seed set , then this algorithm will choose to add either or , because in they minimize , where . But since we can only reach one of the two, we still have . But the optimal vertex to add to the seed set is , where now . Then we get an approximation ratio no better than . ∎
4.2.2. Greedy
We now consider what happens if Greedy is used to choose the seeds. One problem with Myopic was that, as demonstrated via Figure 3, choosing the vertex with the smallest probability ignores the actual objective function (which in that example is maximized by choosing vertex ). What happens when we attempt to maximize the actual objective function? Again, we assume that for any seed set we are given the exact probabilities instead of approximate probabilities, which we refer to as a probability oracle.
Proposition 4.4.
Greedy, even with a probability oracle, has an approximation ratio no better than .
Proof.
Consider the simple undirected path of length , with no initial seeds, where we may add seeds. The greedy algorithm, in the first iteration, must choose the central vertex (assume is even). In the second iteration, no vertex can increase the minimum probability, so the minimum probability is . However, the optimal minimum probability is much larger: If the two seeds trisect the path so that they are apart, then no vertex is distance more than from a seed, in which case the minimum probability is at least . ∎
4.2.3. Minimax distance
Gonzalez is a heuristic to minimize the maximum distance of any vertex from a seed. One motivation behind this algorithm is that in Figure 3, adding an edge from to in takes care of the issues found with Myopic by ensuring that all vertices have distance no more than two from the seed. In general, minimizing the maximum distance exactly is difficult, but even if we could do so, this approach still has a bad approximation ratio.
To show this, we construct a (sparse, max degree two) graph where nonetheless the vertex furthest away from the seed still has a relatively high probability of receiving the information. This is the case for , shown in Figure 4, that’s sufficiently sparse but is large.
Lemma 4.5.
The probability of being infected in , where each edge has weight , is .
Proof.
Denote by level the vertices distance from , and by symmetry, the probability of being infected at that level . We want to calculate . Note and a variant of the logistic map.
Then . Note . Unwinding the recurrence, we get , and in particular we have i.e. . ∎
Proposition 4.6.
The algorithm that minimizes the maximum distance from a seed has approximation ratio when .
Proof.
Suppose we can choose at most one seed in , shown in Figure 5. Minimizing the max distance means the seed we use is , and for sufficiently large the minimum probability is , at least for (using the previous lemma). However, the optimal seed to use is , where is a vertex distance from . Under this seed set, remains the vertex with the minimum probability of getting infected so long as, for some constant , (again using the previous lemma). Solving for to maximize the minimum probability, we get . Then the approximation ratio is no better than , and finally note has edges, vertices, and the maximum in-degree (and out-degree) is two. ∎
Despite the hardness results of this section, we will show in the next section that these algorithms perform well in practice.
5. Experiments
Our experimental evaluation will investigate the following question: does maximizing create real changes in access? Is this different from the interventions achieved via maximum reach? And how effective are the proposed strategies for optimizing ? Since our goal in this paper is to introduce and validate a method for reducing access gaps, we will not focus on achieving the fastest implementations (although we will compare the efficiency of different heuristics).
5.1. Experimental procedure
For our evaluation, we used social networks sourced from the SNAP (Leskovec and Krevl, 2014) and ICON (Clauset et al., 2016) repositories as described in Table 1.
is a stringent objective function: it minimally requires having at least one seed in every connected component to achieve non-zero minimum probability, which may require a large number of added seeds if for example there are many disconnected nodes. Since the access gap is maximally large if there is at least one seed and a vertex with , we assume that the number of added seeds is large enough to cover all components of the graph. This allows us to add seeds to each component of the graph separately. As a simplifying assumption, in the experiments, we only consider the case (in directed graphs) when the components are strongly connected. In particular, rather than running the heuristics on all of the components, we just use the largest strongly connected component of the graph.
We also varied our intervention size between and , independent of the size of the graph. This is a typical number of seeds used for interventions in the literature, and considering the application – recommending a job position – is a practical intervention size. We varied – the probability of message transmission across an edge – in the range 333We report results for for brevity. Behavior below this range was similar.. Above this range information spreads so effectively that all algorithms are indistinguishable. Below this range the utilities obtained are small enough that it is hard for Monte Carlo estimation to distinguish between them. We run simulations in order to estimate probabilities for any given seed set and repeat each heuristic times, reporting the average result.
As a baseline, we used the algorithm TIM+ (Tang et al., 2014b), which was designed to optimize maximum reach. While this procedure is not a true baseline (it does not directly optimize ), it provides insight into how existing methods for maximum reach might work in this newer setting. We also use as a baseline picking the seeds uniformly at random (which we will refer to as Random).
5.2. Maximin and network structure
In practice, what are the effects of using maximin over max reach as the objective? We give evidence that when maximizing reach instead of using maximin, interventions end up strongly reflecting the existing structure of the network. That is, vertices are more likely to become seeds if they are close to the center of the network, where probabilities of receiving the information are already high and do not need as many additional interventions.
We show this by measuring the correlation between the probability of receiving information before intervention versus after intervention. We use as a simple proxy for ‘before intervention’ the probabilities when the vertex with the highest degree is the sole seed. Figure 6(a) shows the correlation between these two sets of probabilities in the Arenas graph, and indeed the correlation is significantly higher when using TIM+ than when using Myopic.
Assuming every vertex is equally deserving of information, we do not want ‘well-positioned’ vertices to have an advantage simply because they are well-positioned. Thus, we look at the correlation between the probability of information access after intervention and a few other proxies for position in a network. Figures 6(b) and 6(c) show the results for the degrees of the vertices as well as their distances from the center of the graph. Using TIM+, as the distance decreases towards the center or the degree of the node increases, the probabilities of information access increase, leading to a larger (negative) correlation. Again, this effect is lessened by using Myopic, whose resulting probabilities correlate less than TIM+ with both the degree of the vertex and the distance from the center. In other words, Myopic reduces the correlation between vertices’ probability of receiving the information and how well connected the vertices are. Naïve Myopic yields very similar results to Myopic, as again seen in Figure 6.
In addition, Myopic changes the distribution of probabilities . Not only does it decrease the number of vertices with very low probability of receiving the information, but it also increases the number of vertices with larger probabilities over a broad range of probabilities, as seen in Figure 7.
5.3. Heuristic performance
We now study the behavior of the heuristics described in the previous section. We would like to know how they compare in terms of effectiveness (maximizing ) and speed.
We present effectiveness results in Figure 8. We omitted the heuristic Greedy when experimenting with larger data sets because it was prohibitively slow. Note that in both charts, the Myopic and Naïve Myopic heuristics consistently outperform the other methods for all ranges of and intervention size . The heuristics that do not use estimation are all consistently poor performers, and TIM+ performs well but is consistently dominated. For the smaller data sets, shown in Figure 8(a), Greedy also does fairly well.
The running time of the heuristics is summarized in Table 2, which shows there is a natural tradeoff between running time and effectiveness. In particular, while the methods that make no use of estimation yield poorer quality results, they run extremely fast because they avoid the expensive step of estimating probabilities. Among the heuristics that estimate probabilities, Naïve Myopic is the fastest, with TIM+ also comparable, while the Myopic heuristic is an order of magnitude more expensive. Greedy is still another order of magnitude slower than Myopic, making it prohibitively expensive to compute in even relatively small graphs.
5.4. Performance on max reach
While the goal of the introduced heuristics is to maximize the minimum information access, it is also valuable to measure them by their average reach to see if they are effective at spreading information to a large number of vertices. We compare the performance of Naïve Myopic and Myopic to TIM+ on this measure over three datasets (see Figure 9). The results show that while Naïve Myopic does not perform well to maximize reach, Myopic appears to outperform TIM+ even though TIM+ was designed for average reach and Myopic was not. This is likely because each seed added by Myopic is guaranteed to increase reach on the graphs, while algorithms that focus on maximizing reach may inadvertently provide access to nodes already reached. However, recall that Myopic is much slower than TIM+ (see Table 2) and so this potential improvement does not come without pitfalls. This tradeoff between average and minimum reach seems worthy of further study.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2kon (2017) 2017. U. Rovira i Virgili network dataset – KONECT. http://konect.uni-koblenz.de/networks/arenas-email
- 3Arthur et al . (2009) David Arthur, Rajeev Motwani, Aneesh Sharma, and Ying Xu. 2009. Pricing Strategies for Viral Marketing on Social Networks. In Internet and Network Economics, 5th International Workshop, WINE 2009, Rome, Italy, December 14-18, 2009. Proceedings . 101–112.
- 4Benthall and Haynes (2019) Sebastian Benthall and Bruce D. Haynes. 2019. Racial categories in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency . ACM, 289–298.
- 5Boyd et al . (2014) Danah Boyd, Karen Levy, and Alice Marwick. 2014. The networked nature of algorithmic discrimination. Data and Discrimination: Collected Essays. Open Technology Institute (2014).
- 6Chen et al . (2009) Wei Chen, Yajun Wang, and Siyu Yang. 2009. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009 . 199–208.
- 7Clauset et al . (2016) Aaron Clauset, Ellen Tucker, and Matthias Sainz. 2016. The Colorado Index of Complex Networks. https://icon.colorado.edu/ .
- 8Coleman (1988) James S Coleman. 1988. Social capital in the creation of human capital. American journal of sociology 94 (1988), S 95–S 120.
