Efficient Approximation Algorithms for Adaptive Seed Minimization
Jing Tang, Keke Huang, Xiaokui Xiao, Laks V.S. Lakshmanan, Xueyan, Tang, Aixin Sun, and Andrew Lim

TL;DR
This paper introduces ASTI, an efficient adaptive seed minimization algorithm that selects seed nodes in multiple batches to influence a target number of users in social networks, with proven approximation guarantees.
Contribution
The paper presents the first adaptive seed minimization algorithm with provable approximation guarantees and practical efficiency, outperforming existing non-adaptive methods.
Findings
ASTI achieves near-optimal influence with fewer seed nodes.
ASTI runs in expected polynomial time, scalable to large networks.
Experimental results show ASTI outperforms competing algorithms in effectiveness and efficiency.
Abstract
As a dual problem of influence maximization, the seed minimization problem asks for the minimum number of seed nodes to influence a required number of users in a given social network . Existing algorithms for seed minimization mostly consider the non-adaptive setting, where all seed nodes are selected in one batch without observing how they may influence other users. In this paper, we study seed minimization in the adaptive setting, where the seed nodes are selected in several batches, such that the choice of a batch may exploit information about the actual influence of the previous batches. We propose a novel algorithm, ASTI, which addresses the adaptive seed minimization problem in expected time and offers an approximation guarantee of in expectation,…
| Notation | Description |
|---|---|
| a graph with node set and edge set | |
| the number of nodes and edges in | |
| the threshold for the targeted number of nodes to be activated | |
| the spread of a seed set and its expectation | |
| the truncated spread of and its expectation | |
| the -th residual graph, where | |
| the number of nodes and edges in | |
| the shortfall in activating nodes in the -th round, i.e., | |
| the marginal spread of on top of , i.e., the spread of in | |
| the marginal truncated spread of on top of , i.e., | |
| a binary estimator with value if and otherwise | |
| a random mRR-set and a set of mRR-sets | |
| the number of mRR-sets in covered by | |
| the optimal node maximizing , , and , respectively | |
| the optimum of , i.e., | |
| a specific realization, a random realization, and the realization space | |
| a random policy, and an optimal policy |
| Dataset | Type | Avg. deg. | LWCC size | ||
|---|---|---|---|---|---|
| NetHEPT | 15.2K | 31.4K | undirected | 4.18 | 6.80K |
| Epinions | 132K | 841K | directed | 13.4 | 119K |
| Youtube | 1.13M | 2.99M | undirected | 5.29 | 1.13M |
| LiveJournal | 4.85M | 69.0M | directed | 28.5 | 4.84M |
| 0.01 | 0.05 | 0.1 | 0.15 | 0.2 | ||
|---|---|---|---|---|---|---|
| IC Model | NetHEPT | N/A | 40.8% | 43.8% | 43.0% | 43.7% |
| Epinions | N/A | N/A | 50.7% | N/A | 65.7% | |
| Youtube | 0.0% | 24.3% | N/A | 37.5% | 41.7% | |
| LiveJournal | N/A | 43.0% | 34.9% | N/A | 33.0% | |
| LT Model | NetHEPT | N/A | N/A | N/A | 44.3% | 47.5% |
| Epinions | N/A | N/A | N/A | N/A | N/A | |
| Youtube | 0.0% | 39.5% | 54.1% | N/A | 47.9% | |
| LiveJournal | N/A | N/A | N/A | N/A | N/A |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Efficient Approximation Algorithms for Adaptive Seed Minimization
Jing Tang
Dept. of Ind. Syst. Engg. and Mgmt.National University of Singapore
,
Keke Huang
School of Comp. Sci. and Engg.Nanyang Technological University
,
Xiaokui Xiao
School of ComputingNational University of Singapore
,
Laks V.S. Lakshmanan
Department of Computer ScienceUniversity of British Columbia
,
Xueyan Tang
School of Computer Science and EngineeringNanyang Technological University
,
Aixin Sun
School of Computer Science and EngineeringNanyang Technological University
and
Andrew Lim
Dept. of Ind. Syst. Engg. and Mgmt.National University of Singapore
Abstract.
As a dual problem of influence maximization, the seed minimization problem asks for the minimum number of seed nodes to influence a required number of users in a given social network . Existing algorithms for seed minimization mostly consider the non-adaptive setting, where all seed nodes are selected in one batch without observing how they may influence other users.
In this paper, we study seed minimization in the adaptive setting, where the seed nodes are selected in several batches, such that the choice of a batch may exploit information about the actual influence of the previous batches. We propose a novel algorithm, ASTI, which addresses the adaptive seed minimization problem in O\Big{(}\frac{\eta\cdot(m+n)}{\varepsilon^{2}}\ln n\Big{)} expected time and offers an approximation guarantee of in expectation, where is the targeted number of influenced nodes, is size of each seed node batch, and is a user-specified parameter. To the best of our knowledge, ASTI is the first algorithm that provides such an approximation guarantee without incurring prohibitive computation overhead. With extensive experiments on a variety of datasets, we demonstrate the effectiveness and efficiency of ASTI over competing methods.
Seed Minimization; Sampling; Approximation Algorithm
††ccs: Information systems Data mining††ccs: Information systems Social advertising††ccs: Information systems Social networks††ccs: Theory of computation Probabilistic computation††ccs: Theory of computation Submodular optimization and polymatroids
1. Introduction
Social networks are becoming increasingly popular for people to discuss and share their thoughts and comments towards public topics. Based on the established relations among individuals, ideas and opinions can be spread over social networks via a word-of-mouth effect. To exploit this effect for advertising, advertisers often provide free samples of their products to selected social network users, in exchange for those users to promote those products and create a cascade of influence to other users. In such a setting, advertisers might want to know the minimum number of free samples required to be given away, so as to draw sufficient attention. Goyal et al. (Goyal et al., 2013) are the first to formulate this problem as a seed minimization problem, which asks for the minimum number of seed nodes (i.e., users who receive free samples) needed to influence at least a required number of users, taking into account the randomness in the influence propagation process.
Existing work on seed minimization mostly focuses on the non-adaptive setting (Goyal et al., 2013; Zhang et al., 2014; Han et al., 2017), which requires that all seed nodes should be selected in one batch without observing the actual influence of any node, i.e., no randomness in the influence propagation process can be removed until all seed nodes are fixed. As a consequence of the non-adaptiveness, these solutions may return a seed set that fails to influence at least nodes in the actual propagation process, or may select an excessive number of seed nodes that generate an actual influence spread much larger than required.
To address the above issues, Vaswani and Lakshmanan (Vaswani and Lakshmanan, 2016) propose to consider seed minimization under the adaptive setting, where (i) the seed nodes are selected one by one, and (ii) before selecting the -th seed node, the actual influence of the first seed nodes can be observed, i.e., we may optimize the choice of the -th seed node to influence those users that have not been influenced by the previous seed nodes. Such an adaptive strategy ensures that (i) the seed set returned always achieves the required number of influenced users (since the actual influence of each seed node is known after selection), and (ii) the number of seed nodes would not be excessive (because we can stop selecting seed nodes as soon as the targeted influence is achieved). We note that similar adaptive approaches have also been adopted by other practical problems, such as influence maximization (Yadav et al., 2016), sensor placement (Asadpour et al., 2008), active learning (Chen and Krause, 2013), and object detection (Chen et al., 2014).
To our knowledge, the only existing solution for adaptive seed minimization is by Vaswani and Lakshmanan (Vaswani and Lakshmanan, 2016). As we discuss in Section 2.4, however, the solution in (Vaswani and Lakshmanan, 2016) requires that the expected influence of any seed set should be estimated with extremely high accuracy, which results in prohibitive computation overhead. Furthermore, the solution does not provide any non-trivial approximation guarantee, due to an ineffective approach used to select each seed node under the adaptive setting. Therefore, it remains an open problem to devise efficient approximation algorithms for adaptive seed minimization.
In this paper, we address the above open problem with ASTI, a novel framework tailored for adaptive seed minimization. The key idea of ASTI is to adaptively choose the seed node with the maximum expected truncated influence spread in each round of seed selection. Specifically, given a diffusion model that captures the uncertainty of influence propagation in , we consider the set of all possible realizations, each of which represents a possible scenario of influence propagation among the nodes in . For each possible realization , the influence spread of a seed set , denoted as is the number of nodes influenced by , while the truncated influence spread of is defined as . We consider instead of because, intuitively, the extra influence spread beyond is useless for fulfilling the requirement on influence. (In fact, as we show in Section 2.4, the extra influence spread may even lead to incorrect choice of seed nodes, and hence, it has to be ignored.)
When developing algorithms under the ASTI framework, the key challenge that we face is the design of methods to accurately estimate a seed set ’s expected truncated influence spread over a given set of possible realizations. We show that existing methods (Huang et al., 2017; Nguyen et al., 2016; Tang et al., 2015, 2018b, 2014; Borgs et al., 2014) for estimating un-truncated influence spread cannot be applied in our truncated setting, since they are unable to take into account the effect of truncation by . Motivated by this, we propose a novel sampling method based on the concept of multi-root reverse reachable (mRR) sets, and prove that our method provides non-trivial guarantees in terms of the efficiency and accuracy of truncated influence estimation. Building upon this sampling method, we develop TRIM, an algorithm for maximizing truncated influence spread with a provable approximation guarantee of . We show that instantiating ASTI using TRIM leads to strong theoretical guarantees for adaptive seed minimization, and TRIM can be extended into a batched version TRIM-B that selects a batch of nodes in each round, so as to accelerate seed selection.
In summary, we make the following contributions:
- •
ASTI**, a general framework.** We analyze the characteristics of adaptive seed minimization, based on which we propose a general framework ASTI tailored for the problem.
- •
mRR-set, a novel sampling method. ASTI requires accurate estimation of truncated influence spreads, for which the existing sampling methods are either inefficient or ineffective. To address this challenge, we propose a novel sampling method, mRR, which is able to estimate the truncated influence spread in a cost-effective manner.
- •
TRIM**, an efficient algorithm for truncated influence maximization.** A key step of ASTI is to identify a set of nodes with the maximum expected truncated influence spread, for which we propose the TRIM algorithm based on mRR-sets. With a rigorous theoretical analysis, we show that ASTI instantiated by TRIM returns a -approximate solution for adaptive seed minimization with expected time complexity of O\big{(}\frac{\eta\cdot(m+n)}{\varepsilon^{2}}\ln n\big{)}.
- •
TRIM-B**, the batched version of TRIM.** For further performance gain, we extend TRIM into a batched version TRIM-B that selects seed nodes in a predefined batch size in each round. ASTI instantiated by TRIM-B provides an approximation guarantee of with the same time complexity as TRIM.
- •
An extensive set of experiments. We experimentally evaluate ASTI instantiated by TRIM and TRIM-B against the state-of-the-art non-adaptive algorithm ATEUC (Han et al., 2017), and show that (i) our solutions are much more effective in minimizing the number of seed nodes needed and ensuring that the required influence spread is achieved, and (ii) our solutions are able to efficiently handle social networks with millions of nodes and edges.
2. Preliminaries
This section formally defines the problem of adaptive seed minimization, and reviews the existing solutions. Table 1 summarizes the notations that are frequently used. For ease of exposition, our discussions focus on the independent cascade (IC) model (Kempe et al., 2003), which is one of the most widely adopted propagation models in the literature. But we note that our algorithms can be easily extended to other propagation models, such as the linear threshold model (Kempe et al., 2003) and the topic-aware models (Barbieri et al., 2012).
2.1. Influence Propagation and Realization
Let be a social network with a node set and a directed edge set , where and . For any edge , we refer to as an incoming neighbor of , and as an outgoing neighbor of . Each edge is associated with a propagation probability . We refer to such a social network as a probabilistic social network.
Given a node set , the influence propagation initiated by under the independent cascade (IC) model (Kempe et al., 2003) is modeled as a discrete-time stochastic process as follows. At time slot (the subscript indicates the index of the time slot), all nodes in are activated while all other nodes are inactive. Suppose that node is first activated at slot , then has one chance to activate each outgoing neighbor with the probability at slot , after which remains active. This influence propagation process continues until no more inactive nodes can be activated. As to the linear threshold (LT) model, it demands that for each node , the propagation probabilities of all edges ending at sum up to no more than . With a given node set , LT model works in a similar discrete-time stochastic procedure as follows. At time slot , each node is assigned with a threshold sampled uniformly from , and only nodes in are activated. At time slot , we check all inactive node of its incoming edges from activated neighbors that if the sum of their propagation probabilities is no smaller than . If it is, then is activated; otherwise remains inactive. This influence propagation process terminates once there is no further node activated. Let be the total number of active nodes in when the influence propagation terminates. We refer to as the seed set, and as the spread of .
Alternatively, the influence propagation process can also be described by the live edge procedure (Kempe et al., 2003). Specifically, for each edge , we independently flip a coin of head probability to decide whether the edge is live or blocked to generate a sample of influence propagation. All the blocked edges are removed and the remaining graph is referred to as a realization of the probabilistic social network , denoted as . Note that there are distinct possible realizations. Let be the set of all possible realizations (i.e., the sample space) such that , and denote that is a realization randomly sampled from . Given a realization , the spread of any seed set under is the total number of nodes that are reachable from , denoted as . Thus, for any seed set , its expected spread is defined as
[TABLE]
where is the probability for realization to occur. In other words, the expected spread of is the (weighted) average spread over all the realizations in .
2.2. Adaptive Seed Minimization
Given a probabilistic social network and a threshold , the seed minimization problem aims to select a minimum number of seed nodes to influence at least nodes. In the conventional “non-adaptive” setting, seed minimization requires selecting a node set such that , without any knowledge of realization that would occur in the actual influence propagation process. As a consequence, the selected may influence fewer than nodes for some realizations or much more than nodes for some other realizations, both of which are undesirable scenarios.
Meanwhile, the adaptive strategy (i.e., a recursive select-observe-select procedure) has been shown to be more effective than the non-adaptive (i.e., just select based on model) strategy in many real-world applications (Asadpour et al., 2008; Chen and Krause, 2013; Chen et al., 2014). Specifically, an adaptive strategy first selects a node from graph , and then observes the set of nodes activated by choosing node as a seed node. Based on this observation, the strategy would choose the next node as one that could influence as many currently inactive nodes as possible. This procedure is carried out in an recursive manner, until at least active nodes are observed.
Figure 1 illustrates the adaptive strategy. Figures 1(a) and 1(b) show a social graph and one possible realization of , respectively. Let and be the actual realization of influence propagation (which is unknown apriori). Figure 1(c) indicates that we first select node (in dark gray) as a seed node. Note that node influences nodes and (in light gray), with each bold (resp. dashed) arrow denoting a successful (resp. failed) step of influence. In addition, the thin arrows in Figures 1(c)–1(d) correspond to influence attempts which are not yet revealed. Since the number of nodes influenced by is less than , we continue to select the second seed node. Figure 1(d) shows that we select , which results in a total of active nodes, reaching the threshold . Then, the adaptive seed selection process terminates.
In this paper, we aim to study seed selection strategies (referred to as policies) for adaptive seed minimization (ASM), which is formally defined as follows:
Definition 2.1 (Adaptive Seed Minimization).
Given a probabilistic social graph and a threshold , the adaptive seed minimization problem aims to identify a policy that minimizes the expected number of seed nodes required to achieve an influence spread of at least on possible realizations , i.e.,
[TABLE]
where is the seed set selected by under realization and .
Note that when the propagation probability of every edge in is , ASM reduces to the deterministic version of seed minimization, which is shown to be NP-hard (Goyal et al., 2013). Therefore, finding an optimal policy for ASM is also NP-hard.
2.3. Truncated Influence Spread
Note that, in ASM, the influence spread in excess of the threshold has no value. Accordingly, we introduce the notion of truncated influence spread as follows.
Definition 2.2 (Truncated Influence Spread).
Given a seed set and a threshold , the truncated influence spread of under a realization is the smaller one between and , i.e.,
[TABLE]
Recall that ASM requires considering the influence spreads of nodes when the actual influence of some other nodes has been observed. Therefore, we also introduce the notion of marginal truncated influence spread as follows. Let and . Let be the subset of nodes that remain inactive after round , be the subgraph of induced by . We refer to as the -th residual graph. For example, in Figure 1, after round , only nodes remain inactive, so and denotes the induced subgraph containing the thin edge .
Let be the set of nodes selected as seeds by a policy in the first rounds. Similar to the definition of , we denote as the set of all possible realizations in the -th round. Then, for a node set , we define the marginal spread as the additional spread that provides on top of under realization , and define truncated marginal spread accordingly, i.e.,
[TABLE]
Note that is exactly the influence spread of in the residual graph under realization .
Let be the number of nodes in , i.e., nodes have been activated by the end of round , based on the partial realization revealed so far. Define . This is the amount by which the policy falls short of the target in the beginning of round . Before reaching the threshold , i.e., , we can rewrite as
[TABLE]
Then, can be easily computed in the residual graph . For brevity, we define for a singleton node set .
Finally, we define the expected marginal truncated spread as
[TABLE]
In other words, the expected marginal truncated spread of a node is defined based on the “lift” in the expected number of active nodes that brings on top of previously selected seeds, over all realizations consistent with what has been observed in previous rounds.
2.4. Existing Solutions
Golovin and Krause (Golovin and Krause, 2017) study the adaptive stochastic minimum cost coverage problem, which can be regarded as a variant of ASM in the case where there exists an oracle that accurately reports the expected marginal truncated spread for any given seed set. They propose to adopt a greed policy as follows. First, select the node with the largest expected truncated spread, i.e., for all . Then, observe the actual nodes that are activated by during the stochastic process, and remove them from to induce the residual graph . After that, identify the node with the maximum expected marginal truncated spread in the residual graph . This process continues, such that each round selects the node with the largest expected marginal truncated spread, until we observe that no less than nodes have been influenced.
Golovin and Krause (Golovin and Krause, 2017) show that the above greedy policy returns a -approximate solution to the optimum.111Golovin and Krause claim that the approximation guarantee is in an earlier version of their work (Golovin and Krause, 2011), but point out that the proof has gaps in a revised version (Golovin and Krause, 2017). Whether the logarithmic bound holds is an interesting open problem. This approximation guarantee, however, does not lead to a practical algorithm for the ASM problem, because (i) it requires the help from an oracle to exactly identify the node with the maximum expected marginal truncated spread in each round, but (ii) computing the exact expected spread of any node set is #P-hard (Chen et al., 2010a).
Motivated by this observation, Vaswani and Lakshmanan (Vaswani and Lakshmanan, 2016) attempt to extend Golovin and Krause’s method by replacing the oracle with an spread estimator with bounded errors. In particular, they assume that for any node set , the estimation of the marginal gain should satisfy
[TABLE]
where denotes the multiplicative error in calculating the marginal gains. Unfortunately, this requirement on the spread estimation is so stringent that no existing methods for influence estimation could fulfill the requirement without incurring prohibitive estimation overhead. To explain, suppose that the expected marginal spread of a node on top of is small. In that case, Equation (7) would only allow a trivial amount of estimation error, which is rather difficult to achieve by existing methods for spread estimation.
In addition, the algorithm in (Vaswani and Lakshmanan, 2016) attempts to select the node with the largest marginal spread in each round, instead of the node with the maximum marginal truncated spread. As a consequence, even when there exists an efficient estimator that provides highly accurate spread estimation, the algorithm in (Vaswani and Lakshmanan, 2016) would still fail to achieve the type of approximation guarantee in (Golovin and Krause, 2017), which the theoretical analysis in (Golovin and Krause, 2017) is based on the notion of truncated spreads. We illustrate this issue with an example.
Example 2.3.
Consider Figure 2(a), which shows a social graph with four nodes and four directed edges. The number on each edge indicates the propagation probability of the edge. has four possible realizations , , , and in total, as shown in Figures 2(b)–2(e). Each realization has an equal probability of to happen. Assume that . Then, the expected spread of node is , which is larger than that of the other three nodes. Thus, when the vanilla expected spread is adopted as the measure, node will be selected as the first seed node. On realizations , , and , is qualified to influence at least users. However, there is a probability of that happens, in which case can only influence itself, and hence, one additional seed node is required. Overall, seed nodes are selected in expectation.
Now observe that the expected truncated spread of nodes , , , and are , , , and , respectively. Therefore, when the expected truncated spread is adopted as the measure, either or is selected as the first seed node, which can influence users under all four realizations. This demonstrates that, for ASM, choosing nodes based on expected truncated spreads is more effective than that based on vanilla expected spreads.
In recent work (Han et al., 2018), Han et al. study the problem of adaptive influence maximization, which also considers the adaptive setting, but aims to identify a predefined number of seed nodes that could influence the maximum number of users in in expectation. At the first glance, it may seem that we can modify the adaptive influence maximization algorithms to solve the adaptive seed minimization problem, in the same way that existing work (Goyal et al., 2013) transforms non-adaptive influence maximizing algorithms to address non-adaptive seed minimization. This approach, however, does not work because the algorithm in (Han et al., 2018) is designed based on vanilla expected marginal spreads. Instead, ASM requires considering truncated expected marginal spreads, as we previously discussed. As a consequence, the algorithm in (Han et al., 2018) cannot be adopted in our setting.
3. Our Solution
3.1. Algorithmic Framework
We propose a general framework, referred to as ASTI, to address the ASM problem. Algorithm 1 shows the details. Given a probabilistic social graph and a threshold , ASTI aims to return a seed set such that , where is the truncated influence spread of (i.e., the smaller one of the threshold and the number of active nodes influenced by ). In a nutshell, ASTI iteratively (i) selects the node to maximize the expected marginal truncated spread (Line 1), (ii) observes the newly influenced nodes (Line 1), and then (iii) updates the corresponding information (Lines 1–1). The process stops when at least nodes are activated (Line 1).
The key step of ASTI is truncated influence maximization that targets at identifying a node to maximize the expected marginal truncated spread (Line 1). If an -approximate solution for truncated influence maximization is obtained in each round (Line 1), ASTI provides a non-trivial approximation guarantee, as shown in the following theorem.
Theorem 3.1.
Suppose is an -approximate greedy policy, for some , i.e., for any and , it selects a node satisfying
[TABLE]
Then achieves an approximation ratio of to the optimal adaptive seed minimization policy.
The proof222The formal proofs of all theoretical results are given in Appendix B. of Theorem 3.1 is based on adaptive submodular optimization (Golovin and Krause, 2017). Theorem 3.1 requires that the policy should be an -approximate greedy one with respect to the expected marginal truncated spread . The challenge for designing such an -approximate greedy policy lies in how to develop a proper sampling method for estimating the truncated influence spread.
3.2. Truncated Influence Maximization
According to Theorem 3.1, in order to provide the theoretical guarantee, the algorithm is supposed to identify a node whose truncated marginal spread is an -approximation to the maximum truncated marginal spread in each round. At a first glance, it seems that we can utilize Borgs et al.’s reverse influence sampling method (Borgs et al., 2014). Unfortunately, in what follows, we show that Borgs et al.’s sampling method (Borgs et al., 2014) fails to estimate the truncated influence spread accurately.
Specifically, Borgs et al. (Borgs et al., 2014) propose to generate random reverse reachable (RR) sets for influence maximization. Compared with the Monte-Carlo simulation (Kempe et al., 2003), RR-sets can dramatically accelerate the seed selection process while retaining the same approximation guarantees for influence maximization (Borgs et al., 2014). In particular, a random RR-set of is generated by first selecting a node uniformly at random, and then taking the nodes that can reach in a random realization. Evidently, a random RR-set is a subgraph of the corresponding random realization , which is generated by performing a reverse breadth first search (BFS) on starting from the random node . A random RR-set is an unbiased spread estimator, i.e., for any seed set ,
[TABLE]
Unfortunately, RR-sets fail to estimate truncated influence spread accurately. Intuitively, the expectation of this estimator for truncated influence spread of is
[TABLE]
Recall that the true expected truncated influence spread is
[TABLE]
Obviously, for any , unless ,
[TABLE]
Specifically, consider the case that for all . Then, this estimator is biased with a discount , which is extremely inaccurate when . In practice, is likely to be a fraction of , since even a set of ten thousand seed nodes has been found to influence less than half population on many datasets (Nguyen et al., 2017). These facts indicate that RR-sets are highly biased for estimating truncated influence spread. As a consequence, the state-of-the-art algorithms (Huang et al., 2017; Nguyen et al., 2016; Tang et al., 2015, 2018b, 2014) for influence maximization that utilize RR-sets (Borgs et al., 2014) cannot provide theoretical guarantees for truncated influence maximization. In turn, this means that these algorithms cannot be fashioned to solve ASM with approximation guarantees. To address this issue, we propose a novel sampling approach that generates multi-root reverse reachable (mRR) sets which can estimate the truncated influence spread efficiently and effectively. The algorithm utilizing mRR-sets is referred to as TRIM 333TRuncated Influence Maximization.. We rigorously show that TRIM can provide strong theoretical guarantees for truncated influence maximization and thus ASTI instantiated with TRIM is guaranteed to approximate ASM within a constant ratio.
3.3. Multi-Root Reverse Reachable Set
If we generate correlated RR-sets such that (i) they start from distinct nodes, and (ii) the materialization of each edge is consistent in all the RR-sets, then merging these RR-sets (with duplicates removed) as well as the edge statuses forms a realization sample. Based on this observation, if we generate correlated RR-sets using the same rule, then merging them as a -root RR-set is likely to estimate the truncated influence spread more accurately compared against a vanilla RR-set. To explain how multi-root reverse reachable (mRR) set works, we first introduce its definition.
Definition 3.2 (Random mRR-set).
Let be a random realization of sampled from the realization space and be a size- node set selected uniformly at random from . A random mRR-set is the set of nodes in that can reach . (That is, for each node in the mRR-set, there is a directed path in from to some node in .)
By definition, the key difference between an mRR-set and an RR-set is that the former has multiple roots whereas the latter has one single root only. Similar to the generation of RR-sets, a random mRR-set can be generated by:
- (1)
Choose a set of nodes uniformly at random; 2. (2)
Perform a stochastic reverse breadth first search (BFS) that starts from and follows the incoming edges of each node. Insert into all nodes that are traversed during the stochastic BFS.
A natural question is how to decide the size of for truncated spread estimation? The setting of yields a tradeoff between efficiency and accuracy in that a larger provides more accurate estimation but takes more computational resources. Through the aforementioned analysis of RR-set, we find that the high-efficiency of RR-set comes from its “binary” property. In particular, a random RR-set estimates the influence spread of any node set as if , and as [math] otherwise. To avoid maintaining the edge statuses, our mRR-set estimator shall retain this binary property. That is, it estimates the truncated influence spread of as if and only if intersects this mRR-set, and as [math] otherwise. For a given -RR-set , if a node , then can reach at least one of the starting nodes. Then, its influence spread is estimated to be at least and thus its estimated truncated influence spread is at least . By setting , the estimated truncated influence spread is .
On the other hand, to improve the accuracy, should be set as large as possible. So we choose . However, is not an integer in general. To address this issue, we adopt a randomized rounding approach. To generate a mRR-set, we randomly choose a set of nodes such that its size equals with probability , and equals otherwise. Then, the expectation of is . However, we note that when , the possible value of the estimated truncated influence spread is no longer binary (i.e., [math] or ). To address such a new challenge, we define an estimator as if and only if , and otherwise. At the first glance, it seems that the relationship between and is unclear. Fortunately, the following theorem shows that under the above setting of such that , the ratio of and is in the range of .
Theorem 3.3.
Let and be the integer and fractional part of , respectively. For any mRR-set, if we sample nodes such that with probability and otherwise, then
[TABLE]
Theorem 3.3 states that is a biased but sufficiently accurate estimator of the expected truncated influence spread . In fact, this estimator also works for any residual graph . Specifically, let be the estimated truncated spread of in with respect to , the lowered target corresponding to graph . Recall that . We have the following corollary.
Corollary 3.4.
In the residual graph , let and be the integer and fractional part of , respectively. For each mRR-set, if we sample nodes such that with probability and otherwise, then
[TABLE]
Furthermore, for any two sets , it holds that
[TABLE]
Now, we can construct a -approximate greedy policy using the estimator built upon mRR-sets.
Remark. It is worth pointing out that our randomized rounding approach for choosing is critical for achieving the above approximation bound. Specifically, if we fix to be , following the proof methodology of Theorem 3.3, we may derive that the ratio of to will be in the range of . On the other hand, if we fix to be , the ratio of to will be in the range of . Both settings yield much coarser bounds than our setting that uses a smart randomized rounding approach.
3.4. The Design of TRIM
Algorithm 2 presents the details of TRIM that can return a -approximate solution for truncated influence maximization for any input graph and error threshold . TRIM is similar in spirit to OPIM-C which is the state-of-the-art algorithm for influence maximization (Tang et al., 2018b). Specifically, OPIM-C uses two disjoint groups of random RR-sets, among which one group is used to derive the solution and the other is used to verify its quality. We customize TRIM by utilizing one group of mRR-sets, which would be more efficient for selecting a singleton seed set as pointed out in (Huang et al., 2017). In a nutshell, TRIM starts from a small number of mRR-sets and iteratively increases the mRR-set number until a satisfactory solution is identified. Next, we discuss the details of TRIM.
In the mRR-set sampling stage (Lines 2 and 2), each mRR-set is started from a random set of nodes whose size is an independent random number. Recall that is with probability and otherwise. Given a set of random mRR-sets, we say that a node covers a mRR-set if , and we define the coverage of in , denoted as , as the number of mRR-sets in that are covered by . Based on the mRR-sets generated, TRIM identifies the node that covers the largest number of mRR-sets in (Line 2). Let be the optimal node such that . Then, is bounded by . According to Lemma A.2 in Appendix A, with high probability, (Line 2) is a lower bound on the expected coverage of in , which indicates that
[TABLE]
Similarly, with high probability, (Line 2) is an upper bound on the expected coverage of in . Thus,
[TABLE]
In addition, by Equation (11) in Corollary 3.4, we know that
[TABLE]
Combining Equations (12)–(14), we can derive a quantitative relationship between and such that with high probability
[TABLE]
Therefore, the final guarantee is . Note that in our stopping condition of (Line 2), we use (defined in Line 2) to correct the error on Equations (12) and (13) (with low failure probability). This proves the approximation ratio of .
3.5. Theoretical Analysis
Before we proceed to the theoretical analysis, we first present the hardness of ASM.
Lemma 3.5.
Given a probabilistic social network with and a threshold , for any , adaptive seed minimization cannot be approximated within a ratio of in polynomial time unless .
Approximation Guarantee. Theorem 3.1 indicates that any -approximation greedy policy could achieve an approximation ratio of . We examine the potential of TRIM to serve the role of such a policy. To cope with the randomness of seed selection algorithms (due to sampling), we use the notion of expected approximation guarantee, which considers the average case. We first obtain the approximation ratio of TRIM for each round of seed selection.
Lemma 3.6.
For the -th round of seed selection in , TRIM returns a -approximate solution to the optimum.444Here, -approximation indicates that , which is required by Theorem 3.1 for a randomized algorithm through a detailed check of the proof of Theorem 40 in (Golovin and Krause, 2017).
Combining Theorem 3.1 and Lemma 3.6, we obtain the approximation guarantee of ASTI.
Theorem 3.7.
ASTI* with the instantiation of TRIM achieves an expected approximation ratio of .*
Time Complexity. The time complexity of TRIM is dominated by the procedure for generating mRR-sets. Intuitively, this is based on (i) how much time is used for generating a random mRR-set, and (ii) how many mRR-sets are generated. In what follows, we show their relationship. In particular, for the -th round of seed selection in , let (resp. ) be the optimum (resp. optimal node) of , i.e., . (Note that maximizes , maximizes , and maximizes .) We first show the expected time used for generating a random mRR-set in the following lemma.
Lemma 3.8.
For the -th round of seed selection in , the expected time complexity for generating a random mRR-set is O\big{(}\frac{{\operatorname{OPT}}_{i}}{\eta_{i}}m_{i}\big{)}.
Now, we present the following lemma that gives the expected number of mRR-sets generated by TRIM. The proof is similar to that of OPIM-C (Tang et al., 2018b).
Lemma 3.9.
For the -th round of seed selection in , the expected number of mRR-sets TRIM generated is O\big{(}\frac{\eta_{i}\ln{n_{i}}}{\varepsilon^{2}{\operatorname{OPT}}_{i}}\big{)}.555In general, it is O\big{(}\frac{\eta_{i}\ln{({n_{i}}/{\varepsilon})}}{\varepsilon^{2}{\operatorname{OPT}}_{i}}\big{)}. Here, we assume that \varepsilon\in\Omega\big{(}\frac{1}{\operatorname{poly}(n_{i})}\big{)}.
Finally, we provide the expected time complexity of TRIM in the following lemma.
Lemma 3.10.
For the -th round of seed selection in , TRIM achieves an expected time complexity of O\big{(}\frac{m_{i}+n_{i}}{\varepsilon^{2}}\ln{n_{i}}\big{)}.
At the first glance, the expected time complexity of TRIM is counterintuitive. In particular, the expected root size of in the -th round is increasing with . It seems that the time complexity of TRIM is more likely to increase with . However, Lemma 3.10 just tells us the opposite. This is due to either the residual graph being reduced significantly (Lemma 3.8) or the mRR-set size being reduced considerably (Lemma 3.9). Overall, the time complexity of TRIM in each round can be independent of the number of initially selected nodes. There are at most rounds in total, we can derive the expected time complexity of ASTI instantiated with TRIM.
Theorem 3.11.
ASTI* with the instantiation of TRIM has an expected time complexity of O\big{(}\frac{\eta\cdot(m+n)}{\varepsilon^{2}}\ln{n}\big{)}.*
4. Extensions
TRIM selects one node in each round until at least users are influenced. Therefore, the seed selection phase in ASTI instantiated by TRIM can be quite time consuming due to that the marginal (truncated) spread of a singleton node set is potentially small which may (i) involve in many rounds to achieve the target , and (ii) generate a large number of mRR-sets for constructing an -approximate solution in each round. To mitigate the enormous overhead, we propose a batched version of TRIM, referred to as TRIM-B 666TRuncated Influence Maximization in the Batched model. algorithm, to accelerate the node selection process of ASTI.
4.1. Batched Version of TRIM
Algorithm 3 shows the details of the TRIM-B algorithm. TRIM-B generalizes TRIM by selecting a fixed number of seeds in each round, where is an input parameter to determine the batch size. Specifically, TRIM-B first generates a small number of random mRR-sets and then uses a greedy algorithm for maximum coverage (Vazirani, 2003) to identify a size- seed set to cover mRR-sets with an approximation guarantee of (Line 3). If meets the condition (Line 3), TRIM-B terminates; otherwise, the number of mRR-sets is doubled until a qualified is derived. Consequently, the approximation ratio of TRIM-B is . Note that when the batch size is , TRIM-B degenerates to TRIM.
The major differences in the design between TRIM-B and TRIM are as follows. First, in TRIM-B, the definitions of variables and are involved with and for generalization, as shown in Line 3 and Line 3, respectively. Second, to obtain the upper bound on the coverage of the optimal solution in , the coverage of is divided by (Line 3). Third, the ratio in the stop condition is updated to be (Line 3).
4.2. Theoretical Analysis
The theoretical analysis of TRIM-B can be obtained by generalizing the properties of TRIM.
Approximation Guarantee. To establish the overall approximation guarantee, we first analyze the approximation ratio of TRIM-B in each round of seed selection.
Lemma 4.1.
For the -th round of seed selection in , TRIM-B returns a -approximate solution, where .
Combining Theorem 3.1 and Lemma 4.1, we obtain the approximation guarantee of TRIM-B.
Theorem 4.2.
ASTI* with the instantiation of TRIM-B achieves an expected approximation ratio of .*
Remark. Note that there exists a gap between the optimal policy in the sequential model and the optimal policy in the batched model, which is known as the adaptivity gap (Golovin and Krause, 2017). Adaptivity gap quantifies the performance difference between the optimal adaptive policy and the optimal non-adaptive policy. To explain, a size- seed set is selected as a batch () in TRIM-B without observing the realization of any seed therein. This selection is an non-adaptive process compared to that of in TRIM. As a consequence, there exists an adaptivity gap between the two algorithms if the batch size . However, to the best of our knowledge, this adaptivity gap remains unknown in viral marketing applications, which makes it hard to quantify the difference between the optimal policy in the sequential model and that in the batched model. Meanwhile, the existing bound of adaptivity gap of in (Chen and Krause, 2013) is not applicable to adaptive seed minimization. It holds only if the nodes in social graph are independent, which, however, is not true.
Time Complexity. The time complexity of TRIM-B depends on three factors: (i) the time for generating a random mRR-set, (ii) the number of mRR-sets generated, and (iii) the time to derive a size- seed set. The expected time used for generating a random mRR-set is given in Lemma 3.8. We now show the number of mRR-sets generated.
Lemma 4.3.
For the -th round of seed selection in , the expected number of mRR-sets TRIM-B generates is O\Big{(}\frac{\eta_{i}\ln{\binom{n_{i}}{b}}}{\varepsilon^{2}{\operatorname{OPT}}_{b,i}}\Big{)}, where denotes the maximum expected truncated spread among all the size- seed sets in .
On the other hand, the greedy algorithm for identifying the size- seed set runs in time linear to the total size of its input (Vazirani, 2003), i.e., . Meanwhile, the total number of mRR-sets examined in all the iterations is within twice of that in the last iteration. According to Wald’s equation (Wald, 1947), the expected time complexity of the greedy procedure is , which is dominated by that for generating mRR-sets. Consequently, by Lemma 3.8 and Lemma 4.3, the expected time used in the -th round of TRIM-B is O\big{(}\tfrac{b(m_{i}+n_{i})\ln n_{i}}{\varepsilon^{2}}\big{)}. There are at most rounds in total. Based on the analysis above, the expected time complexity of TRIM-B is given in the following theorem.
Theorem 4.4.
ASTI* with the instantiation of TRIM-B achieves an expected time complexity of O\big{(}\frac{\eta\cdot(m+n)}{\varepsilon^{2}}\ln{n}\big{)}.*
5. Additional Related Work
In Section 2.4, we have discussed the work (Vaswani and Lakshmanan, 2016) most related to ours. In what follows, we survey other relevant work in the literature.
Influence maximization, as the dual problem of seed minimization, seeks to identify a set of seed nodes with the maximum expected spread. Domingos and Richardson (Domingos and Richardson, 2001; Richardson and Domingos, 2002) are the first to study viral marketing from an algorithmic perspective. After that, Kempe et al. (Kempe et al., 2003) formulate the influence maximization problem and propose a greedy algorithm that returns -approximation for several influence diffusion models, by utilizing Monte Carlo simulations. Subsequently, there has been a large body of research on improved algorithms for influence maximization (Kim et al., 2013; Chen et al., 2010b, a, 2009; Goyal et al., 2011a; Jung et al., 2012; Wang et al., 2010; Leskovec et al., 2007; Kempe et al., 2003, 2005; Borgs et al., 2014; Tang et al., 2015, 2018b, 2014; Nguyen et al., 2016; Huang et al., 2017; Tang et al., 2017, 2018a; Galhotra et al., 2016; Arora et al., 2017; Cheng et al., 2014; Cohen et al., 2014; Goyal et al., 2011b; Zhou et al., 2013). Among them, some recent work (Borgs et al., 2014; Huang et al., 2017; Nguyen et al., 2016; Tang et al., 2015, 2018b, 2014) focuses on algorithms that ensure -approximations by utilizing the reverse influence sampling technique (Borgs et al., 2014).
Seed minimization, which has mainly been studied from the non-adaptive perspective, aims at finding a minimum-size set of seed nodes to achieve a given threshold of expected spread. Chen (Chen, 2009) investigates seed minimization under a variant of the linear threshold model, where each node is assigned with a fixed threshold. Chen shows that the problem cannot be approximated within a ratio of unless as the expected spread function under the fixed threshold model is not submodular. After that, Long and Wong (Long and Wong, 2011) study seed minimization under the widely used independent cascade and linear threshold models. Goyal et al. (Goyal et al., 2013) provide a bi-criteria approximation algorithms for seed minimization. Zhang et al. (Zhang et al., 2014) then improve the theoretical results by removing the bi-criteria restriction. However, the requirements of these algorithms are either impractical or extremely stringent, which makes these algorithms vastly ineffective in practice. Han et al. (Han et al., 2017) propose the ATEUC algorithm for non-adaptive seed minimization by utilizing reverse influence sampling for estimating the spreads of nodes. However, the expected time complexity of the algorithm is unknown, and its worst-case time complexity is prohibitively large. As we show in the experiments, our adaptive algorithm is more effective than these non-adaptive algorithms in terms of the number of seed nodes required.
Finally, there is a series of recent work (Vaswani and Lakshmanan, 2016; Horel and Singer, 2015; Badanidiyuru et al., 2016; Seeman and Singer, 2013; Han et al., 2018) that focuses on adaptive influence maximization. Recall that, as analyzed in Section 3.1, to construct approximate solutions for adaptive seed minimization, some approximation algorithms for truncated influence maximization are required. However, the algorithms for adaptive influence maximization generally target at maximizing the influence spread in each round, which cannot provide theoretical guarantees for truncated influence maximization, as we point out in Section 3.2. As a consequence, techniques developed for adaptive influence maximization are inapplicable to the adaptive seed minimization problem. In addition, in the case of influence maximization, going adaptive does not really boost the spread significantly, as confirmed by the experiments in (Han et al., 2018). However, it shall be observed in our experiments that going adaptive provides a substantial advantage for seed minimization.
6. Experiments
This section evaluates the performance of the proposed algorithms against the state of the art. All the experiments are conducted on a Linux machine with an Intel Xeon 2.6GHz CPU and 64GB RAM. For fair comparison, we first randomly generate possible realizations for each dataset, and then measure the performance of each algorithm on those realizations and report the average performance.
6.1. Experimental Setting
Datasets. The experiments are conducted on four datasets, i.e., NetHEPT, Epinions, Youtube, and LiveJournal. NetHEPT (Chen et al., 2009) represents the academic collaboration networks of ”High Energy Physics - Theory” area. The rest of the three are real-life social networks from (Leskovec and Krevl, 2014). Table 2 summarizes the details of the four datasets. Note that an undirected edge is transformed into two directed edges. There does exist any isolated node in the four tested datasets. Furthermore, the number of nodes in the largest weakly connected component (LWCC) indicates that nodes are highly interconnected, especially for the three social networks. As shown in Figure 3, all the four datasets have a power law degree distribution. The largest dataset that has been used for adaptive seed minimization in the literature contains nodes and edges (Vaswani and Lakshmanan, 2016), which is far smaller than LiveJournal. To the best of our knowledge, LiveJournal with millions of nodes and edges is the largest dataset ever tested in adaptive seed minimization experiments.
Algorithms. We evaluate six algorithms: ASTI, ASTI-2, ASTI-4, ASTI-8, AdaptIM and ATEUC (Han et al., 2017). ASTI- is ASTI instantiated by TRIM-B with the batch sizes of . (Note that ASTI is the version with a batch size of .) AdaptIM is modified from the AdaptIM-1 method proposed in (Han et al., 2018) for the adaptive influence maximization problem. It iteratively runs a non-adaptive influence maximization algorithm (i.e., EPIC (Han et al., 2018)) to select the node that maximizes the expected marginal influence spread on the residual graphs, until the desired threshold is reached. AdaptIM differs from our ASTI algorithm in that it greedily selects the node to maximize the influence spread instead of the truncated influence spread. The batch size of AdaptIM is set to by default. As introduced in Section 5, ATEUC is the state of the art for the non-adaptive seed minimization problem. By comparing ASTI with ATEUC, we aim to prove the advantage of adaptivity over non-adaptivity in terms of the effectiveness. Meanwhile, three batched algorithms, i.e., ASTI-2, ASTI-4, ASTI-8, are compared with both ASTI and ATEUC to study how the batch size would affect the efficiency and effectiveness. For AdaptIM, we obtain the source code of AdaptIM-1 from the authors (Han et al., 2018) with some necessary modifications (e.g., stop condition). For the other five algorithms, we implement them in C++ strictly following the algorithm description and compile them with the same optimization options.
Parameter Settings. In our experiments, all the algorithms are tested under both the Independent Cascade (IC) model and the Linear Threshold (LT) model. Following the common setting in the literature (Tang et al., 2014; Arora et al., 2017), we set the approximation parameter for the five adaptive algorithms. For those parameters in ATEUC, we use the values recommended in (Han et al., 2017). For each dataset, we set the edge probability where is the in-degree of node .
The performance metrics measured include the number of seeds selected and the corresponding running time. To better understand the performance of the algorithms, we design the large setting of the threshold for NetHEPT, Epinions, and Youtube, i.e., , where is the number of nodes in the social network. Observing that around nodes are required on LiveJournal under the large setting which is not convenient for exhibition, we thus use a tailored small setting, i.e., for LiveJournal.
6.2. Results under the IC model
Seed Size vs. Threshold. Figure 4 reports the number of seeds selected by the six algorithms for different thresholds under the IC model. As can be seen, ASTI selects far fewer seed nodes than ATEUC does, especially when the threshold becomes larger. In general, ATEUC selects around – more nodes than ASTI does on all the four datasets. In particular, with a threshold on dataset Epinions, ASTI selects seed nodes on average while ATEUC needs seed nodes (i.e., more nodes). For the sake of clarity, Table 3 shows the exact improvement ratio of ASTI over ATEUC on the number of seed nodes for the corresponding five thresholds under both the IC and LT model. Note that there exist many points (indicated by N/A) where the actual number of nodes activated by the seed set returned by ATEUC does not reach the required threshold under some realizations. This is because ATEUC selects a node set such that but may influence fewer than nodes under some realizations, whereas our adaptive algorithms always ensure that at least nodes are influenced by the returned node set under every realization. We shall explore this in more detail in Section 6.4. These facts support the superiority of adaptive algorithms over non-adaptive algorithms. We also observe that the number of nodes selected by AdaptIM is close to that of ASTI, which indicates that AdaptIM is empirically effective in seed minimization. However, it does not provide any approximation guarantees in terms of the number of nodes selected. Another interesting observation is that ASTI-2, ASTI-4, and ASTI-8 slightly increase the number of seed nodes selected compared with ASTI and still select nodes far less than ATEUC does for most of the cases. This confirms that adaptive algorithms by utilizing the information of partial realizations are more effective than non-adaptive algorithms.
Running Time vs. Threshold. Figure 5 presents the results of running time against the threshold under the IC model. As the results show, ATEUC runs faster than the other five adaptive algorithms on the four datasets when the threshold is large. The main reason is that adaptive algorithms involve multiple rounds of seed selection whereas only one round is required for non-adaptive algorithms. Observe that the running time of ATEUC generally decreases with the increase of the threshold , unlike the results of the five adaptive algorithms. The reason lies in the design of ATEUC. Specifically, ATEUC selects two seed set candidates and , which are taken as the upper bound and lower bound on the number of seed nodes in the optimal solution. Only when the condition is satisfied, the candidate set is returned as the solution; otherwise ATEUC will continue to refine and (Han et al., 2017). The larger the threshold, the more seed nodes are required, and the more easily this stop condition is met, which explains the unique running time pattern of ATEUC. We also observe that AdaptIM runs around – times slower than ASTI for all cases. Particularly, AdaptIM cannot finish within hours when under the IC model on the LiveJournal dataset (see Figure 5(d)). This demonstrates that AdaptIM is significantly inferior to ASTI in terms of computational overheads. The reason behind this is that ASTI selects the node to maximize the expected marginal truncated spread, while AdaptIM attempts to maximize the expected marginal influence spread. Specifically, recall that the expected number of mRR-sets generated by ASTI is proportional to . Meanwhile, the expected number of RR-sets generated by AdaptIM is proportional to , where is the maximum expected marginal influence spread in the -th round of seed selection in . For the last few rounds of seed selection, we have , which indicates that the number of mRR-sets generated by ASTI is much smaller than the number of RR-sets generated by AdaptIM. Consequently, ASTI runs remarkably faster than AdaptIM. As such, ASTI is more preferable than AdaptIM, as the former provides significantly better efficiency and approximation guarantees than the latter, while offering similar empirical effectiveness. Note that the batched algorithms, i.e., ASTI-2, ASTI-4, and ASTI-8, reduce the running time significantly, to around , , and of ASTI, which makes them quite competitive with ATEUC in terms of the efficiency, not to mention AdaptIM. In addition, as explained earlier, the terminal condition in ATEUC is easier satisfied when the threshold is larger, and hence, ATEUC runs faster along with the increase of . On the other hand, the running times of the adaptive algorithms increase with . Therefore, ASTI-4 and ASTI-8 outperform ATEUC on datasets Epinions and Youtube when is relatively small, but when the threshold , the running times of all three algorithms become similar, as shown in Figures 5(b) and 5(c). Recall that ASTI-8 selects far fewer seed nodes than ATEUC does. Therefore, ASTI-8 strikes a good balance between efficiency and effectiveness in the current setting. We also observe that the running time of ASTI-8 fluctuates from to on datasets Epinions and Youtube. This is due to the combined effects of the threshold and the batch size. In these cases, it needs no more than nodes to reach the thresholds. Consequently, ASTI-8 finishes selecting seed nodes within just one round. However, when increases from to , the root size of mRR-sets decreases. As a consequence, it takes relatively less time to generate a random mRR-set in practice, which leads to the decrease in running time.
6.3. Results under the LT model
Seed Size vs. Threshold. Figure 6 reports the number of nodes selected by different algorithms under the LT model. In general, the results show similar trends to those observed in Figure 4. Similarly, AdaptIM selects a close number of nodes as ASTI does on the four datasets, with negligible difference. ATEUC requires around more nodes than the five adaptive algorithms do. Details are displayed in Table 3. In addition, we also observe that ASTI-8 selects more nodes than ATEUC for several settings (e.g., on the Epionions and Youtube datasets). Through a careful analysis, we find that (i) all the algorithms select less nodes under the LT model than those under the IC model, and (ii) ASTI-8 selects seed nodes in a batch with influence spread much higher than the requirements. These observations clearly tell us that there is a tradeoff in the setting of batch size. Increasing the batch size will speed up the algorithms but may result in more nodes selected.
Running Time vs. Threshold. Figure 7 shows the results of running time for different thresholds under the LT model. The conclusions we summarize for Figure 5 are generally applicable to Figure 7 as well. The major differences lie in two aspects: (i) the running time under the LT model is shorter than that under the IC model under the same setting as it takes less time to generate a random mRR-set under the LT model than that under the IC model (as mentioned and analyzed in previous work (Arora et al., 2017; Tang et al., 2018b)), which is consistent with the results in Figure 6, (ii) ASTI-4 outperforms ATEUC on Epinions and ASTI-8 outperforms ATEUC on both Epinions and Youtube for all cases under the LT model. This fact indicates (i) the batched version of ASTI is more scalable than ATEUC does, and (ii) when the batch size is well-calibrated, ASTI can beat ATEUC in both efficiency and effectiveness.
6.4. Discussions on Spread Distribution
As discussed previously, non-adaptive algorithms may find solutions with influence spread far away from the requirement (i.e., either under-qualified or over-qualified). Figure 8 reports the spread distribution of realizations achieved by the ASTI and ATEUC algorithms on the NetHEPT dataset under the IC and LT models, respectively. The solid (red) line in the figure represents the spread threshold () required. As shown, ATEUC fails to reach the threshold for and realizations under the IC and LT models, respectively, with corresponding percentages of and . In addition, for and realizations under the IC and LT models, respectively, the seed nodes selected by ATEUC produce influence spread much higher (over ) than the requirement. In contrast, ASTI meets the spread requirement for all the realizations under both the IC and LT models. Moreover, the spread produced by ASTI is generally kept close to the requirement. The spread exceeds the requirement by more than for only realizations under the LT model. These two over-qualified exceptions are due to that the last seed node selected achieves much higher spread than the gap to reach , which is rare to happen in practice. These observations indicate that non-adaptive algorithms are unreliable for seed minimization.
7. Conclusion
This paper studies the problem of adaptive seed minimization, and proposes algorithms that provide both strong theoretical guarantees and superior empirical effectiveness. Our approach is based on a novel ASTI framework instantiated by a truncated influence maximization algorithm TRIM, which has a provable approximation guarantee. The core of our TRIM algorithm is an elegant sampling method that generates random multi-root reverse reachable (mRR) sets for estimating the truncated influence spread. We also extend TRIM into its batched version TRIM-B to further improve the efficiency of seed selection. With extensive experiments on real data, we show that our solutions considerably outperform the state of the art for seed minimization under both the IC and LT diffusion models.
Acknowledgements.
This research is supported by Sponsor Singapore National Research Foundation under grant Grant #NRF-RSS2016-004, by Sponsor Singapore Ministry of Education Academic Research Fund Tier 2 under grant Grant #MOE2015-T2-2-069, by Sponsor National University of Singapore under an Grant #SUG, by Sponsor Singapore Ministry of Education Academic Research Fund Tier 1 under grant Grant #MOE2017-T1-002-024, and by a Grant #Discovery grant and a Grant #Discovery Accelerator Supplement grant from the Sponsor Natural Sciences and Engineering Research Council of Canada (NSERC) .
Appendix A Concentration Bounds
We show some useful martingale concentration bounds, i.e., the Chernoff-like bounds (Tang et al., 2015) and their variants (Tang et al., 2018b).
Lemma A.1 ((Tang
et al., 2015)).
Let be a martingale difference sequence such that for each . Let . If is identical for every , i.e., , then for any , we have
[TABLE]
Lemma A.2 ((Tang
et al., 2018b)).
Let be a martingale difference sequence such that for each . Let . If is identical for every , i.e., , then for any , we have
[TABLE]
Appendix B Proofs
We first introduce the following lemma that is used to prove Theorem 3.1.
Lemma B.1 ((Golovin and
Krause, 2017)).
If function satisfies all the following conditions:
- •
there exists such that for all ;
- •
* is integer-valued;*
- •
* is self-certifying;*
- •
* is strong adaptive monotone;*
- •
* is strong adaptive submodular;*
then an -approximate greedy policy achieves an approximation ratio of .
Proof of Theorem 3.1.
Obviously, for all and is an integer-valued function. Now, we need to prove that for any , , and
[TABLE]
where . Equation (20) represents self-certifying, Equation (21) describes strong monotonicity, Equations (22) and (23) capture strong adaptive submodularity.
Equation (20) obviously holds, i.e., if , we must have , and vice versa.
Equation (21) holds naturally as “selecting more nodes never hurts” the function .
Next, we prove Equation (22). Let be a realization of with probability according to the influence propagation. Let be the subset realizations of that are consistent with . That is, for every and every edge , the statuses of are the same in and such that both are either live or blocked. Then, for any ,
[TABLE]
In addition, for any , let be the set of nodes activated by in . Thus, is the spread of in under realization . As a consequence, the marginal truncated spread of in under is
[TABLE]
Similarly, for any , we have
[TABLE]
where the inequality is due to and . Therefore,
[TABLE]
Finally, we prove Equation (23). For any , we have
[TABLE]
Taking the expectation over completes the proof. ∎
Proof of Theorem 3.3.
We prove the elementary version of Equation (9), i.e., for any given realization ,
[TABLE]
where the expectation is only taken over the randomness of root size .
Let denote the number of nodes influenced by under . Let be the probability that none of the nodes sampled can be influenced by , which is given by
[TABLE]
Then, by the definition of , with probability , ; and with probability , . As a consequence, we have
[TABLE]
where the expectation on the right hand side is taken with respect to the randomness of . Let be the ratio of to , which is given by
[TABLE]
Now, we need to prove that . We consider the following two scenarios: (i) , and (ii) .
(i) : In this case, . Meanwhile,
[TABLE]
As for any , in the above equation,
[TABLE]
As by assumption, this implies that .
(ii) : In this case, f(x)=\eta\big{(}1-\mathbb{E}[p(x)]\big{)}/x. Take the derivative,
[TABLE]
Let . Take the derivative, when ,
[TABLE]
According to the definition of , we can get that
[TABLE]
Thus, decreases with , which indicates that . This implies that . As a consequence,
[TABLE]
where the last step above follows from the analysis for the case , by considering the special case .
Hence, the theorem is proved. ∎
Proof of Corollary 3.4.
Equation (10) follows directly from Theorem 3.3. By Equation (10),
[TABLE]
Hence, Equation (11) holds. ∎
Proof of Lemma 3.5.
We consider the special case of the adaptive seed minimization problem in which the probability for each edge . In this case, for any node , the set of nodes influenced by is the set of nodes that can be reached by in , denoting as the cover set . Thus, for each node , its cover set is deterministic. As a consequence, the adaptive seed minimization problem reduces to a set cover problem, i.e., aiming to find as few nodes as possible to cover at least nodes. Feige (Feige, 1998) has shown that no polynomial time algorithm can approximate the optimal solution of set cover within a ratio of for any unless . Hence, lemma 3.5 holds on noting that ASM generalizes set cover. ∎
Proof of Lemma 3.6.
Let be the following event:
[TABLE]
Note that is the seed node returned by the policy which is a random variable. Let be the set of possible seed nodes selected (but not necessarily returned) by TRIM in the -th iteration in which each node has a probability such that , where . Let denote the set of random seed nodes returned at the -th iteration of TRIM, where . Therefore, the event does not happen only if there exists a node at iteration satisfying that does not happen.
If TRIM stops at the iteration , according to the setting of and by (Tang et al., 2015), we have
[TABLE]
If TRIM stops at the iteration , for any node , we define two events and as
[TABLE]
where is the expected coverage of in . Then, if is independent of , by Lemma A.2, we have
[TABLE]
By a union bound that ensures all the nodes satisfying Equation (26), we have
[TABLE]
Meanwhile, is independent of naturally. Thus, together with the fact that , by Equation (27)
[TABLE]
As a consequence, when TRIM stops at , if the event does not happen, then at least one of the events and does not happen. Thus, the event does not happen for all with probability at most:
[TABLE]
Combining Equations (25) and (28) shows that the event holds with probability at least . Thus, together with the Equation (11) in Corollary 3.4, we have
[TABLE]
Hence, the lemma is proved. ∎
Proof of Lemma 3.8.
For any node , is not visited by a random mRR-set if and only if . The probability for not visiting under a realization is , where is the number of nodes that can be activated by in under . On the other hand, if a node is visited, all of its incoming edges will be examined. Let denote the number of ’s incoming edges. Then, the expected time complexity for generating a random mRR-set is
[TABLE]
where the expectation is over the randomness of both and . In addition, we already know that
[TABLE]
[TABLE]
Hence, the lemma is proved. ∎
Proof of Lemma 3.9.
Let and be the root of
[TABLE]
where for any and . Let
[TABLE]
As , one can verify that \theta^{\ast}=O\big{(}\frac{\eta_{i}\ln n_{i}}{\varepsilon^{2}{\operatorname{OPT}}_{i}}\big{)}.777Without loss of generality, we assume . If , TRIM achieves a higher approximation of with O\big{(}\frac{\eta_{i}\ln n_{i}}{{\operatorname{OPT}}_{i}}\big{)} mRR-sets. Define the events as follows:
[TABLE]
Then, when a number of mRR-sets are generated, by Lemma A.1, it is easy to verify that any event () does not happen with probability at most
[TABLE]
By the union bound, the probability that all the events happen is at least .
If the events happen,
[TABLE]
Thus, we have
[TABLE]
In addition, let
[TABLE]
According to the definition of , we have
[TABLE]
Since , if event happens (i.e., ), then
[TABLE]
As a consequence, if event also happens, we have
[TABLE]
Similarly, let
[TABLE]
According to the definition of , we have
[TABLE]
Since , if event happens (i.e., ), then
[TABLE]
As a consequence, we have
[TABLE]
Putting it all together of (31), (32) and (33), we have
[TABLE]
Therefore, when a number of mRR-sets are generated, TRIM does not stop only if at least one of the events in does not happen, with probability at most .
Let be the first iteration that the number of mRR-sets generated by TRIM reaches such that and . From this iteration onward, the expected number of mRR-sets further generated is at most
[TABLE]
The first inequality is due to and , and the second inequality is due to . If the algorithm stops before the -th iteration, there are at most random samples generated. Therefore, the expected number of random samples generated is less than , which is O\big{(}\frac{\eta_{i}\ln n_{i}}{\varepsilon^{2}{\operatorname{OPT}}_{i}}\big{)}.
Hence, the lemma is proved. ∎
Proof of Lemma 3.10.
The time complexity of TRIM is determined by that for generating mRR-sets. By Wald’s equation (Wald, 1947), the expected total time used for generating mRR-sets equals the expected number of mRR-sets generated, times the expected time used for generating one mRR-set. Thus, according to Lemmas 3.8 and 3.9, the expected time complexity of TRIM is . ∎
Proof of Lemma 4.1.
Let be the seed set returned by the batched policy with and be the corresponding optimal seed set in the -th round. Let be the following event:
[TABLE]
Let be the generalized definition of in Section 3.5. If is returned at -th iteration, based on the setting of and by (Tang et al., 2015), we still have
[TABLE]
If TRIM-B stops at the iteration , for any node obtained by greedy method with , we define two events and as
[TABLE]
where is the expected coverage of in .
Based on Lemma A.2, we could have
[TABLE]
Similarly, by union bound for all candidates of size- node set, we could immediately have
[TABLE]
Let be the size- seed set that could cover largest number of mRR-sets in . Since is derived by Greedy method from , by the property of greedy method, we have Then can be taken as the upper bound of . Similarly, by Lemma A.2, we have following equation
[TABLE]
By following the analysis in Section 3.5, we acquire the fact that event holds with at least probability where . By Corollary 3.4, the expected approximation ratio of TRIM-B is at least
[TABLE]
Hence, the lemma is proved. ∎
Appendix C Discussions on Influence Spread
Figure 9 reports the spread of the tested algorithms under the IC model (results under the LT model are similar). For the most parts, all the algorithms achieve a comparable spread on the four datasets. The major differences lie in on Epinions and Youtube. As observed, ASTI-8 (resp. ATEUC) achieves the largest (resp. smallest) spread among all algorithms. This is because the batch size is relatively large with regard to the small threshold, owing to which the spread of the 8-size seed set selected by ASTI-8 significantly overshoots on Epinions and Youtube. Another interesting observation is that the spread achieved by ATEUC is slightly larger than each of the other five adaptive algorithms as the threshold becomes larger (not quite noticeable in the figure). This is because ATEUC selects considerably more seeds than the adaptive algorithms do, resulting in a larger spread at the cost of an excessive number of seeds. This is also supported by the results in Table 3.
Appendix D Discussions on Marginal Truncated Spread
To explore the property of the marginal truncated spread, we record the marginal spread of each seed node selected by adaptive algorithms under the realizations sampled. Figures 10 shows the result of each realization with on corresponding datasets (or on the LiveJournal dataset) under the IC model. (The result under the LT model is similar.) In general, the marginal spread diminishes along the index of the seed node, which is consistent with the property of submodularity as expected. Note that the spread fluctuation is due to the randomness of the tested realizations, i.e., in some particular realizations, some seed node selected later may influence more nodes than some seed node selected earlier.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Arora et al . (2017) Akhil Arora, Sainyam Galhotra, and Sayan Ranu. 2017. Debunking the Myths of Influence Maximization: An In-Depth Benchmarking Study. In Proc. ACM SIGMOD . 651–666.
- 3Asadpour et al . (2008) Arash Asadpour, Hamid Nazerzadeh, and Amin Saberi. 2008. Stochastic Submodular Maximization. In Proc. WINE . 477–489.
- 4Badanidiyuru et al . (2016) Ashwinkumar Badanidiyuru, Christos Papadimitriou, Aviad Rubinstein, Lior Seeman, and Yaron Singer. 2016. Locally Adaptive Optimization: Adaptive Seeding for Monotone Submodular Functions. In Proc. SODA . 414–429.
- 5Barbieri et al . (2012) Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco. 2012. Topic-Aware Social Influence Propagation Models. In Proc. IEEE ICDM . 81–90.
- 6Borgs et al . (2014) Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. 2014. Maximizing Social Influence in Nearly Optimal Time. In Proc. SODA . 946–957.
- 7Chen (2009) Ning Chen. 2009. On the Approximability of Influence in Social Networks. SIAM Journal on Discrete Mathematics 23, 3 (2009), 1400–1415.
- 8Chen et al . (2010 a) Wei Chen, Chi Wang, and Yajun Wang. 2010 a. Scalable Influence Maximization for Prevalent Viral Marketing in Large-Scale Social Networks. In Proc. ACM KDD . 1029–1038.
