The Secure Link Prediction Problem
Laltu Sardar, Sushmita Ruj

TL;DR
This paper introduces the first study of secure link prediction on encrypted graphs, proposing three algorithms, proving their security, and testing them on real datasets to protect sensitive network information.
Contribution
It presents the first algorithms for link prediction on encrypted graphs, with formal security proofs and practical implementations.
Findings
Algorithms successfully predict links on encrypted data
Security of the schemes is formally proven
Real-life dataset experiments demonstrate practicality
Abstract
Link Prediction is an important and well-studied problem for social networks. Given a snapshot of a graph, the link prediction problem predicts which new interactions between members are most likely to occur in the near future. As networks grow in size, data owners are forced to store the data in remote cloud servers which reveals sensitive information about the network. The graphs are therefore stored in encrypted form. We study the link prediction problem on encrypted graphs. To the best of our knowledge, this secure link prediction problem has not been studied before. We use the number of common neighbors for prediction. We present three algorithms for the secure link prediction problem. We design prototypes of the schemes and formally prove their security. We execute our algorithms in real-life datasets.
| Param | Entity | - | - | - |
| Leakage | CS | , | , | , |
| PS | ||||
| client | bits | bits | bits | |
| Storage | CS | bits | bits | bits |
| PS | bits | bits | bits | |
| client | ||||
| Compu- | CS | + | + | + |
| tation | + () | () | + () + | |
| PS | + | + | ||
| + | + | |||
| clientCS | bits | bits | bits | |
| Commu- | CSPS | bits | bits | bits + + |
| nication | bits | |||
| PSCS | - | - | ||
| PSclient | bits | bits | bits |
| Dataset Name | #Nodes | #Edges |
|---|---|---|
| bitcoin-alpha | 3,783 | 24,186 |
| ego-facebook | 4,039 | 88,234 |
| email-Enron | 36,692 | 183,831 |
| email-Eu-core | 1,005 | 25,571 |
| Wiki-Vote | 7,115 | 103,689 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Secure Link Prediction Problem
Laltu Sardar,
Cryptology and Security Research Unit,
Indian Statistical Institute, Kolkata, India
E-mail: [email protected]
Sushmita Ruj,
R C Bose Centre for Cryptology and Security
Indian Statistical Institute, Kolkata, India
E-mail: [email protected]
Abstract
Link Prediction is an important and well-studied problem for social networks. Given a snapshot of a graph, the link prediction problem predicts which new interactions between members are most likely to occur in the near future. As networks grow in size, data owners are forced to store the data in remote cloud servers which reveals sensitive information about the network. The graphs are therefore stored in encrypted form.
We study the link prediction problem on encrypted graphs. To the best of our knowledge, this secure link prediction problem has not been studied before. We use the number of common neighbors for prediction. We present three algorithms for the secure link prediction problem. We design prototypes of the schemes and formally prove their security. We execute our algorithms in real-life datasets.
Keywords— Link prediction, Homomorphic encryption, Garbled circuit, Secure computation, Cloud computing.
1 Introduction
Social networks have become an integral part of our lives. These networks can be represented as graphs with nodes being entities (members) of the network and edges representing the association between entities (members). As the size of these graphs increases, it becomes quite difficult for small enterprises and business units to store the graphs in-house. So, there is a desire to store such information in cloud servers.
In order to protect the privacy of individuals (as is now mandatory in EU and other places), data is often anonymized before storing in remote cloud servers. However, as pointed out by Backstrom et al. [3], anonymization does not imply privacy. By carefully studying the associations between members, a lot of information can be gleaned.
The data owner, therefore, has to store the data in encrypted form. Trivially, the data owner can upload all data in encrypted form to the cloud. Whenever some query is made, data owner has to download all data, do necessary computations and re-upload the re-encrypted data. This is very inefficient and does not serve the purpose of cloud service. Thus, we need to keep the data stored in the cloud in encrypted form in such a way that we can compute efficiently on the encrypted data.
Some basic queries for a graph are neighbor query (given a vertex return the set of vertices adjacent to it), vertex degree query (given a vertex, return the number of adjacent vertices), adjacency query (given two vertices return if there is an edge between them) etc. It is important that when an encrypted graph supports some other queries, like shortest distance queries, it should not stop supporting these basic queries.
Nowell and Kleinberg [10] first defined the link prediction problem for social networks. The link prediction problem states that given a snapshot of a graph whether we can predict which new interactions between members are most likely to occur in the near future. For example, given a node at an instant, the link prediction problem tries to find the most likely node with which would like to connect at a later instant. Different types of distance metrics are used to measure the likelihood of the formation of new links. The distances are called score ([10]). Nowell and Kleinberg, in [10], considered several metrics including common neighbors, Jaccard’s coefficient, Adamic/Adar, preferential attachment, Katzβ etc. For example, if and (with no edge between them) have a large number of common neighbors they are more likely to be connected in future. In this paper, for simplicity, we have considered common neighbors metric to predict the emergence of a link.
Though there has been a large body of literature on link prediction, to the best of our knowledge the secure version of the problem has not been studied to date. Secure Link Prediction (SLP) problem computes link prediction algorithms over secure i.e., encrypted data.
Our Contribution We introduce the notion of secure link prediction and present three constructions. In particular, we ask and answer the question, “Given a snapshot of a graph ( is the set of vertices and ) at a given instant and a vertex , which is the most likely vertex , such that, is a neighbor of at a later instant and ”. The score-metric we consider is the number of common neighbors of the two vertices and . This can be used to answer the question, “Given a snapshot of a graph at a given instant and a vertex , which are the -most likely neighbors of at a later instant such that none of these vertices were neighbors of in .”
Note that the data owner outsources an encrypted copy of the graph to the cloud and sends an encrypted vertex as a query. The cloud runs the secure link prediction algorithm and returns an encrypted result, from which the client can obtain the most likely neighbor of . The cloud knows neither the graph nor the queried vertex .
It is to be noted that the client has much less computational and storage capacity. We propose three schemes, (-, - and -), in all of which, the client takes the help of a proxy server which makes it efficient to obtain query results. At a high level:
-: is the most efficient with almost no computation at client-side and leaks only the scores to the proxy server. 2. 2.
-: has a little more communication at client-side compared to - but leaks the scores of a subset of vertices to the proxy server. 3. 3.
-: is a very efficient scheme with almost no computation and communication at the client-side and leaks almost nothing to the proxy. This is achieved with an extra computational and communication cost between the cloud and the proxy.
In all three schemes, the client does not leak anything, to the cloud, except the number of vertices in the graph.
We have designed the scheme in such a way that it supports link prediction query as well as basic queries. Each of the previous schemes on encrypted graph are designed to support a specific query (for example, shortest distance query, focused subgraph query etc.). However, we have designed more general schemes that support not only link prediction query but also basic queries including neighbor query, vertex degree query, adjacency query etc.
All our schemes have been shown to be adaptively secure in real-ideal paradigm.
Further, we have analyzed the performance of the schemes in terms of storage requirement, computation cost and communication cost, and counted the execution time of the schemes assuming benchmark implementations of some underlying cryptographic primitives. we have implemented prototypes for the schemes - and -, and measured the performance with different real-life datasets to study the feasibility.
From the experiment, we see that they take s and s to encrypt whereas s and s process query for a graph with vertices.
Organization The rest of the paper is organized as follows. Related work is discussed in Section 2. Preliminaries and cryptographic tools are discussed in Section 3. Link prediction problem and its security are described in Section 4. Section 5 describes our proposed scheme for -. Two improvements of -, - and -, are discussed in Section 6 and Section 7 respectively. In Section 8, a comparative study of the complexities of our proposed schemes is given. In Section 9, details of our implementation and experimental results are shown. A variant of link prediction problem is introduced in Section 10. Finally, a summary of the paper and future research direction are given in Section 11.
2 Related Work
Graph algorithms are well studied when the graph is not encrypted. Since, necessity of outsourcing graph data in encrypted form is increasing very fast and encryption makes it difficult to work those algorithms, study is required to enable them. There are only few works that deals with the ‘query’ on ‘outsourced encrypted graph’.
Chase and Kamara [6] introduced the notion of graph encryption while they were presenting structured encryption as a generalization of searchable symmetric encryption (SSE) proposed by Song et al. [20]. They presented schemes for neighbor queries, adjacency queries and focused subgraph queries on labeled graph-structured data. In all of their proposed schemes, the graph was considered as an adjacency matrix and each entry was encrypted separately using symmetric key encryption. The main idea of their scheme, given a vertex and the corresponding key, the scheme could return adjacent vertices. However, complex query requires complex operation (like addition, subtraction, division etc.) on adjacent matrix which make the scheme unsuitable.
A parallel secure computation framework GraphSC has been designed and implemented by Nayak et al. [16]. This framework computes functions like histogram, PageRank, matrix factorization etc. To run this algorithms, GraphSC introduced parallel programming paradigms to secure computation. The parallel and secure execution enables the algorithms to perform even for large datasets. However, they adopt Path-ORAM [21] based techniques which is inefficient if the client has little computation power or the client doesn’t uses very large size RAM.
Sketch-based approximate shortest distance queries over encrypted graph have been studied by Meng et al. [14]. In the pre-processing stage, the client computes the sketches for every vertex that is useful for efficient shortest distance query. Instead of encrypting the graph directly, they encrypted the pre-processed data. Thus, in their scheme, there is no chance of getting information about the original graph.
Shen et al. [19] introduced and studied cloud-based approximate constrained shortest distance queries in encrypted graphs which finds the shortest distance with a constraint such that the total cost does not exceed a given threshold.
Exact distance has been computed on dynamic encrypted graphs in [22]. Similar to our paper, this paper uses a proxy to reduce client-side computation and information leakage to the cloud. In the scheme, adjacency lists are stored in an inverted index. However, in a single query, the scheme leaks all the nodes reachable from the queried vertex which is a lot of information about the graph. For example, if the graph is complete, it reveals the whole graph.
A graph encryption scheme, that supports top- nearest keyword search queries, has been proposed by Liu et al. [12]. They have made an encrypted index using order preserving encryption for searching. Together with lightweight symmetric key encryption schemes, homomorphic encryption is used to compute on encrypted data.
Besides, Zheng et al. [24] proposed link prediction in decentralized social network preserving the privacy. Their construction split the link score into private and public parts and applied sparse logistic regression to find links based on the content of the users. However, the graph data was not considered to be encrypted in the privacy preserving link prediction schemes.
In this paper, we outsource the graph in encrypted form. In most of the previous works, the schemes are designed to perform single specific query like neighbor query ([6]), shortest distance query ([14, 19, 22]), focused subgraph queries ([6]) etc. So, either it is hard to get the information about the source graph ([14], [19]), as they do not support basic queries, or leaks a lot of information for a single query ([22]). One trivial approach is that taking different schemes and use all of them to support all types of required queries. In this paper, our target is to get as much information about the graph as possible whenever required with supporting the link prediction query and leak as little information as possible. To the best of our knowledge, the secure link prediction problem has not been studied before. We study issues on link prediction problem in encrypted outsourced data and give three possible solutions overcoming them.
3 Preliminaries
Let be a graph and be its adjacency matrix where is the number of vertices. Let be the security parameter. Set of positive integers is denoted by . By x\xleftarrow{\}XXD\logid:{0,1}^{*}\rightarrow{0,1}^{\log N}negl:\mathbb{N}\leftarrow\mathbb{R}n\forall c\in\mathbb{N}\exists N_{c}\in\mathbb{N}\forall n>N_{c},\ negl(n)<n^{-c}$.
A probabilistic polynomial-time (PPT) permutation is said to be a Pseudo Random Permutation (PRP) if it is indistinguishable from random permutation by any PPT adversary. We consider two PRPs, and , where and are their keys (or seeds) respectively.
3.1 Bilinear Maps
Let and be two (multiplicative) cyclic groups of order and be a generator of . A map is said to be an admissible non-degenerate bilinear map if–
and , we have , 2. 2.
, and 3. 3.
can be computed efficiently.
Our algorithms use bilinear map based BGN encryption scheme [4]. So, we first discuss this.
3.2 BGN Encryption Scheme
Boneh et al. [4] proposed a homomorphic encryption scheme (henceforth referred to as BGN encryption scheme) that allows an arbitrary number of additions and one multiplication. The scheme consists of three algorithms- , and .
Key generation: This takes a security parameter as input and outputs a public-private key pair (see Algo. 1). Here, and . In , is a bilinear map from to where both and are groups of order . Note that, given , returns (see [4]) where and are two large primes, and and are groups of order .
Encryption: An integer is encrypted in using Algo. 2. Let and be two integers that are encrypted in as and . Then, the bilinear map , belongs to , gives the encryption of . Note that arbitrary addition of plaintext is also possible in the group . If is a generator of the group , acts as a generator of the group . Thus, the encryption of an integer is possible in in similar manner (see Algo. 4).
Decryption: At the time of encryption each entry is randomized. The secret key eliminates the randomization. Then, it is enough to find discrete logarithm of the rest. Algo. 3 and Algo. 5 describes the decryption in and respectively. In decryption algorithms, computation can be done with expected time using Pollard’s lambda method [13]. However, it can be done in constant time using some extra storage ([4]).
Let be an encryption scheme as described above. Then, it is a tuple of five algorithms (, , , , ) as described in Algo. 1, 2, 3, 4 and 5 respectively.
3.3 Garbled Circuit (GC)
Let us consider two parties, with input and respectively, who want to compute a function . Then, a garbled circuit [23, 11] allows them to compute in such a way that none of the parties get any ‘meaningful information’ about the input of the other party and none, other than the two parties, is able to compute .
Kolesnikov et al. [8] introduced an optimization of garbled circuit that allows XOR gates to be computed without communication or cryptographic operations [22]. Kolesnikov et al. [7] presented efficient GC constructions for several basic functions using the garbled circuit construction of [8]. In this paper, we use garbled circuit blocks for subtraction (), comparison () and multiplexer () functions from [8].
4 The Secure Link Prediction (SLP) Problem
Given , let denotes the set of vertices incident on . Let be a measure of how likely the vertex is connected to another vertex in the near future, where . A variant of the Link Prediction problem states that given , it returns a vertex () such that is the maximum in i.e.,
[TABLE]
Thus, given a vertex , we find most likely vertex to connect with. There are various metrics to measure score like the number of common neighbors, Jaccard’s coefficient, Adamic/Adar metric etc. In this paper, we consider as the number of common nodes between and i.e., . Let be the adjacency matrix of the graph . If and are the rows corresponding to the vertices and respectively then, the score is the inner product of the rows i.e., . In this paper we have used BGN encryption scheme to securely compute this inner product.
4.1 System Overview
Here, we describe the system model considered for the link prediction problem and goals which we want to achieve.
System Model: In our model (see Fig. 1), there is a client, a cloud server, and a proxy server. Each of them communicates with others to execute the protocol.
The client is the data owner and is considered to be trusted. It outsources the graph in encrypted form to the cloud server and generates link prediction queries. Given a vertex , it queries for the vertex which is most likely to be connected in the future.
The cloud server (CS) holds the encrypted graph and computes over the encrypted data when the client requests a query. We assume that the cloud server is honest-but-curious . It is curious to learn and analyze the encrypted data and queries. Nevertheless, it is honest and follows the protocol.
The proxy server (PS) helps the cloud server and the client to find the most likely vertex securely. It reduces computational overhead of the client by performing decryptions. However, the proxy server is assumed to be honest-but-curious.
All channels connecting the client, the cloud and the proxy servers are assumed to be secure. An adversary can eavesdrop on channels but can not tamper messages sent along it. However, we assume, the cloud and the proxy servers do not collude.
This system model is to outsource as much computation as possible without leaking the information about the data, assuming the client has very low computation power (like mobile devices). This kind of model to outsource computation previously used by Wang et al. [22] for secure comparison. Assumption of the proxy and cloud server do not collude is a standard assumption.
Design Goals: In this paper, under the assumption of the above system model, we aim at providing a solution for the secure link prediction problem. In our design, we want to achieve the following objectives.
Confidentiality: The cloud and proxy servers should not get any information about the graph structure i.e., the servers should not be able to construct a graph which is isomorphic to the source graph. 2. 2.
Efficiency: In our model, the client is weak with respect to storage and computations. Since the cloud server has a large amount of storage and computation power, the client outsources the data to it.
Moreover, the client should efficiently perform neighbor query, vertex degree query or adjacency query. These are the basic query that every graph should support. The client should leak as little information as possible.
4.2 Secure Link Prediction Scheme
In this section, we present definition of link prediction scheme for a graph and its security against adaptive chosen-query attack.
Definition 1**.**
A secure link prediction () scheme for a graph is a tuple , , , , of algorithms as follows.
- •
* is a client-side PPT algorithm that takes as a security parameter and outputs a public key and a secret key .*
- •
* is a client-side PPT algorithm that takes a public key , a secret key and a graph as inputs and outputs a structure that stores the encrypted adjacency matrix of .*
- •
* is a client-side PPT algorithm that takes a secret key and a vertex as inputs and outputs a query trapdoor .*
- •
* is a PPT algorithm run by a cloud server that takes a query trapdoor and the structure as inputs and outputs list of encrypted scores with all vertices.*
- •
* is a PPT algorithm run by a proxy server that takes , and as inputs and outputs the most probable vertex identifier to connect with the queried vertex.*
Correctness: An scheme is said to be correct if, , generated using and all sequences of queries on , each query outputs a correct vertex identifier except with negligible probability.
Adaptive security: An scheme should have two properties:
Given , the cloud servers should not learn any information about and 2. 2.
From a sequence of query trapdoors, the servers should learn nothing about corresponding queried vertices.
The security of an is defined in real-ideal paradigm. In real scenario, the the challenger generates keys. The adversary generates a graph which it sends to . encrypts the graph with its secret key and sends it to . Later, times it finds a query vertex based on previous results (i.e., adaptive) and receives trapdoor for the current. Finally outputs a guess bit . In ideal scenario, on receiving the graph , the simulator generates a simulated encrypted matrix. For each adaptive query of , returns a simulated token. Finally outputs a guess bit . The security definition (Definition 2) ensures can not distinguish from .
We have assumed that the communication channel between the client and the servers are secure. Since the CS and the PS do not collude, they do not share their collected information. So, the simulator can treat CS and PS separately.
In our scheme, the proxy server does not have the encrypted data or the trapdoors. During query operation, it gets a set of scrambled scores of the queried vertex with other vertices. So, we can consider only the cloud server as the adversary (see [5]). Let us define security as follows.
Definition 2** (Adaptive semantic security ).**
Let = , , , , be a secure link prediction scheme. Let be a stateful adversary, be a challenger, be a stateful simulator and be a stateful leakage algorithm. Let us consider two games- (see Algo. 6) and (see Algo. 7).
The is said to be adaptively semantically -secure against chosen-query attacks () if, PPT adversaries , where , a PPT simulator , such that
[TABLE]
4.3 Overview of our proposed schemes
A graph can be encrypted in several ways like adjacency matrix, adjacency list, edge list etc. Each of them has advantages and disadvantages depending on the application. In our scheme, we have defined score as the number of common neighbors that can be calculated just by computing inner product of the rows corresponding to the calculating vertices. The basic idea is that, given a vertex, to predict the most probable vertex to connect with, we compute scores with all other vertices and sort them according to their score. However, calculating the inner product and sorting them in cloud server are expensive operations and there is no scheme that provides all of the functionality to be computed over encrypted data. So, we have used BGN homomorphic encryption scheme that enables us to compute inner product on encrypted data. Choosing BGN, gives power to the client for querying not only link prediction query but also neighbor query, degree of a vertex query, adjacency query etc.
Besides, the score computation, the score decryption and sorting the score in encrypted form is non-trivial keeping in mind that the client has low computation power. So, we have proposed three schemes that perform score computations as well as sorting on encrypted data with the help of a honest-but-querious proxy server which does not collude with the cloud server. The three schemes show tread-off between the computation cost, communication cost and leakage in order to compute the vertex most probable to connect with.
5 Our Proposed Protocol for SLP
In this section, we propose an efficient scheme - and analyze its security. The scheme is divided into three phases– key generation, data encryption, and query phase. The client first generates required secret and public keys. Then it encrypts the adjacency matrix of the graph in a structure and uploads it to the CS. To query for a vertex, the client generates a query trapdoor and sends it to the CS. The CS computes encrypted score (i.e., inner products of the row corresponding to the queried vertex with the other vertices on the encrypted graph). The PS decrypts the scores, finds the vertex with highest score and sends the result to the client.
Key Generation: In this phase, given a security parameter , the client chooses a bilinear map . Then, the permutation key is chosen at random for the PRP . It executes to get and . After generating private key and public key , a part of is shared with the PS. This part of the key helps the PS to compute secure comparisons. Key generation is described in Algo. 8.
Data Encryption: In this phase, the client encrypts the adjacency matrix with its private key and uploads the encrypted matrix to the CS (see Algo. 9). Each entry in the adjacency matrix of is encrypted using Algo. 2. Let be the encrypted matrix. Then, each row of is stored in the structure . The PRP gives the position in corresponding to vertices. Finally, the structure is sent to the CS.
Query: In the query phase, the client sends a query trapdoor to the CS. The CS finds encrypted scores with respect to the other vertices and sends them to the PS. The PS decrypts them and sends the identifier of the vertex with highest score to the client.
To query for a vertex , the client first chooses a secret key s\xleftarrow{\}{0,1}^{\lambda}\pi_{s}that is not known to the PS (see Algo. [10](#algorithm10)). Then it finds the positioni^{\prime}=F_{k_{perm}}(id(v))\tau_{v}=(i^{\prime},s)$ as query trapdoor to the CS.
On receiving , the CS computes the encrypted scores (see Algo. 11) and computes corresponding to the queried vertex. Using , the CS shuffles the order of the encrypted scores and ’s. Finally, the CS sends the shuffled encrypted scores and the scrambled queried-row entries to the PS.
Since, the PS has (), it can decrypt all s and s. It decrypts first and then decrypts only if corresponding decrypted value of is 0. Then, it takes an such that is the maximum in the set and sends it to the client (see Algo. 12). Finally, the client finds the resulting vertex identifier as .
Correctness: For any two rows and , if is the encryption of the score then, . Again, since , we get = .
Thus, of to the base gives . If powers of are pre-computed, the score can be found in constant time. However, Pollard’s lambda method [13] can be used to find discrete logarithm of base .
5.1 Security Analysis
In the security definition, a small amount of leakage has been allowed. The adversary knows the algorithms and possesses the encrypted data and queried trapdoors. Only is unknown to it. The leakage function is a pair (associated with and respectively) where and .
Theorem 1**.**
If is semantically secure and is a PRP, then - is -secure against adaptive chosen-query attacks.
Proof.
The proof of security is based on the simulation-based - security (see Definition 2). Given the leakage , the simulator generates a randomized structure which simulates the structure of the challenger . Given a query trapdoor , returns simulated trapdoors maintaining system consistency of the future queries by the adversary. To prove the theorem, it is enough to show that the trapdoors generated by and are indistinguishable to .
- •
(Simulating the structure ) first generates . Given , takes an empty structure of size . Finally, it takes where .
- •
(Simulating query trapdoor ) first takes an empty dictionary . Given , checks whether is present in . If not, it takes a random -bit string , stores it as and returns . If has appeared before, it returns .
Semantic security of guarantees that and are indistinguishable. Since is a PRP, and are indistinguishable. This completes the proof. ∎
6 - with less leakage
Though the - scheme is efficient, it has few disadvantages. Firstly, in -, the number of common nodes between the queried vertex and all other vertices are leaked to the PS which provides partial knowledge of the graph to it. Since, the server PS is semi honest, we want to leak as little information as possible. In this section, we propose another scheme - that hides most of the scores from the PS which results in leakage reduction.
Secondly, the client has high communication cost with PS while processing a link prediction query. Our proposed - scheme has an advantages over this with reduced communication cost from CS to PS is. We achieve these by using extra storage of size of the matrix and extra bandwidth from the PS to the CS of .
6.1 Proposed Protocol
In -, after computing the scores, the CS increases that of the incident vertices randomly from maximum possible score i.e., degree of the queried vertex. For example, let be a score in the form , then a random number , greater than or equal to the degree, is added with it. Then the scores is increased as . Since lower bound of is known to the client, it can eliminate the scores with adjacent vertices. The PS only derypts the scores and sends the sorted list to the client. Since the degree is hidden from PS and known to the client, it can eliminate the vertices with score larger than degree. The algorithms are as follows.
Key Generation: Same as Algo. 8.
Data Encryption: In -, data encryption is similar to Algo. 9. Together with , another matrix is generated by encrypting a matrix B (see Algo. 13). The matrix where , () if and are connected, else . Now, , where notations are usual. Finally, The matrices and are uploaded to the CS together in structures and respectively. Rows of and corresponding to the vertex are stored in and respectively. Note that, entries of are in the group whereas that of are in .
Query: As in the previous scheme, the client sends query trapdoor to the CS for a vertex . Let be the set of encrypted scores computed in step 11 of Algo. 14. In addition, for each , is updated as . Then is sent to the PS. Instead of sending to the PS, is sent to the client, which results the encryption of the degree of the vertex . - query is described in Algo. 14.
The PS decrypts as and sorts them. Then, the PS sends , , , where ’s are in sorted order and ’s are their indices in (see Algo.15).
The client takes the first index such that . The client gets by decrypting . Finally, the client can find the resulting vertex identifier as .
Correctness: For all , the decrypted entry (line 15, Algo. 15) is equals to where is the actual score. Since and is zero, when and are connected, we can see that, becomes greater than when and are connected. So, the client can eliminate these entries from the list.
6.2 Security Analysis
- does not leak any extra information to the CS than -. The leakage is same as it is in -.
Theorem 2**.**
If is semantically secure and is a PRP, then - is -secure against adaptive chosen-query attacks.
Proof.
As we have seen the proof of Theorem 1, The simulator requires to simulate the , and . To simulate the structure , given , takes an empty structure of size . Finally, it takes , . Rest of the proof is similar as that of Theorem 1. ∎
7 scheme using garbled circuit (-)
In -, the PS is still able to get scores with many vertices and there is a good amount of communication cost from PS to the client. In this section, we propose - in which PS does not get any scores. Besides, the proxy needs to send only result to the client which reduces communication overhead for the client.
7.1 Protocol Description
In -, after generating the keys, the client encrypts the adjacency matrix of the graph and uploads it to the CS. At the same time, it shares a part of its secret key with the PS. In the query phase, the CS computes the encrypted scores on receiving query trapdoor from the client. However, it masks each score with random number selected by itself before sending them to the PS. The PS decrypts the masked scores and evaluates a garbled circuit, constructed by the CS (as described in Section 7.2), to find the vertex with maximum score. Finally, the PS returns the index corresponding to the evaluated identifier of the vertex with maximum score.
Key Generation: Same as Algo. 8.
Data Encryption: Same as Algo. 9.
Query: To query for a vertex , the client generates a query trapdoor (see Algo. 10) and sends it to the CS. On receiving , the CS computes the encrypted scores . It then considers the row corresponding to the queried vertex. Then, with random and , it computes, and , for all . If the encrypted scores are sent directly, the PS can decrypt the scores directly as it has the partial secret key . That is why the CS chooses random s and s to mask them.
To find the vertex with highest score, the CS builds a garbled circuit (see Fig. 2) as described in Section 7.2. The CS sends and together with a garbled circuit . The CS-side algorithm is described in Algo. 16.
The PS receives and . , let and be the decryption of and respectively (see Algo. 17). Then, the PS evaluates . During evaluation, the PS gives all s and s and corresponding indices s as input where . The CS gives s and s where , (see Section 7.2).
From , the PS gets an index which is sent to the client. Finally, the client finds the resulting vertex identifier as .
7.2 Maximum Garbled Circuit (MGC)
We want minimum information to be leaked to both the servers. Without the knowledge of values, it is hard to find the maximum value because it is an iterative comparison process and requires several round of communication if we use only secure comparison. However, building a maximum garbled circuit allows cloud and proxy servers to find the maximum without knowing the value by anyone.
Kolesnikov and Schneider [7] first presented a garbled circuit that computes minimum from a set of distance. In their scheme, one party holds a set of points and the second party holds a single point. They used homomorphic encryption to compute the the distances from the single points to the set of points and sort them using the garble circuit. However, the original value of the points belongs to them were known to them. In this paper, we have introduced a novel maximum garbled circuit () by which one party computes the maximum from a set of numbers, without the knowledge their values, with the help of another party without leaking them to it. Given a set of scores outputs only the identity of the vertex with maximum score.
Computing vertex with max score: In -, the CS computes a garbled circuit (an example is shown in Fig. 2) for each query to find the maximum scored vertex identifier. Before computing , in -, the PS gets and (Algo. 17). The CS keeps and which are used as input in . During construction, it keeps the indices in the such a way that outputs only the index of the resulted maximum score.
is required to find the index corresponding to the maximum scored vertex. The circuit is constructed layer by layer. The idea is to compare pair of scores every time in a layer and pass the result for the next until the resulted vertex is found. If , has layers starting from [math] to . In the 0th layer, there are number of blocks and the rest of the blocks are block. The blocks is for the 1st layers and computes the scores securely without knowing them. Thus, each block corresponds to some vertex. computes the maximum score and corresponding index without knowing them. Example of a , to compute maximum, assuming and using blocks and blocks, is shown in Fig. 2. for any is constructed similarly.
Max blocks There are 4-types of blocks to compute the maximum- , , and (see Fig. 3). The blocks are made different to handle extreme cases. These blocks use and blocks (see Section 3.3).
NSS blocks: Each block has four inputs , , and . The inputs and comes from the CS while and comes from the PS. It first subtracts from using block to get the score . Then, using block, it finds the flag bit that tells whether the vertex is adjacent to the queried vertex. block (see Fig 4(b)) is used in block as shown in Fig. 4(a) to make the score zero if the vertex is adjacent else keeps the score same.
Elimination of scores for adjacent vertices: It can be seen from encryption that , where is the actual score corresponding to th row and randomizes the score. Each bit is taken to indicate whether is odd or even. On the other hand, each bit indicates whether the decrypted is odd or even. Inequality of and indicates that the vertex corresponding to th row is connected with the queried vertex. In that case, we consider the score .
The block , in Fig. 4(c), finds outputs if they are equal, else outputs [math]. Since, gives the score, block (see Section. 3.3) is used in to compute the scores where the PS gives and CS gives . It can be seen that subtract only one bit which is very efficient.
7.3 Security Analysis
In -, though the PS has almost no leakage, the CS has a little more leakage than -. This extra leakage occurs when it interacts with the PS through OT protocol to provide encoding corresponding to the input of PS. Since OT is secure and does not leak any meaningful information, we can ignore this leakage. In -, the leakage is same as it is in -.
Theorem 3**.**
If is semantically secure and is a PRP, then - is -secure against adaptive chosen-query attacks.
Proof.
The proof is the same as that of Theorem 1. ∎
7.4 Basic Queries
All the three schemes support basic queries which includes neighbor query, vertex degree query and adjacency query.
Neighbor query: Given a vertex, neighbor query is to return the set of vertices adjacent to it. It is to be noted that, since we have encrypted adjacency matrix of the graph, it is enough for the client if it gets the decrypted row corresponding to the queried vertex,
To query neighbor for a vertex , the client generates as in Algo. 10 and sends it to the CS. The CS permutes rows corresponding to row and send the permuted row to the PS. The PS decrypts them and send the decrypted vector to the client. The client can compute inverse permutation for the entries for which the the entries are 1. Here, the CS gets only the queried vertex and the PS gets the degree of the vertex.
Vertex degree query: To query degree of a vertex , similarly, the client sends to the CS. The CS computes encrypted degree as and sends to the proxy. The proxy decrypts and sends the result to the client. is not needed as permuting the elements of some row is not required.
Here, the degree is leaked to the PS which can be prevented by randomizing the result. The CS can randomize the encrypted degree and send the randomization secret to the client. The client can get the degree just by subtracting the randomization from the result by the PS.
However, this leakage can be avoided easily, without randomizing the encrypted degree, if the client performs the decryption.
Adjacency Query: Given two vertices, adjacency query (edge query) tells wither there is an edge between them. If the client wants to perform adjacency query for the pair of vertices and , the client sends (as generated in Algo. 10) to the CS. The CS returns . The client can get either the randomized result from the PS or it can decrypt by itself.
8 Performance Analysis
In this section, we discuss the efficiency of our proposed schemes. The efficiency is measured in terms of computations and communication complexities together with storage requirement and allowed leakages. A summary is given in Table 1. Since there is no work on the secure link prediction before, we have not compared complexities of our schemes with any other similar encrypted computations.
8.1 Complexity analysis
Let the graph be and . Let encryption outputs -bit string for every encryption. We describe the complexities as bellow.
Leakage Comparison: As we see the Table 1, each scheme leaks, to the CS, same amount of information which is the number of vertices of the graph and the query trapdoors. However, none of the schemes leaks information about the edges in the graph to the CS. In -, since the PS has the power to decrypt the scores, it gets to know . However, - reveals only a subset of and - manages to hide all scores from the PS. - can not hide scores from the PS which results in maximum leakage to the PS.
Storage Requirement: One of the major goals of secure link prediction scheme is that the client should require very little storage. All our designed schemes have very low storage requirement for the client. The client has to only store a key which is of bits. For all schemes, the PS stores only a part of the secret key which is of bits.
In -, the CS is required to store bits for the structure where the PS is required to store only the secret key. While reducing the leakage in -, the CS storage becomes doubled. However, - requires the same amount of storage as -.
Computation Complexity: In all schemes, the client computes number of encryption to encrypt while - additionally computes number of the same to encrypt . To compute each of encrypted scores, the CS requires bilinear map () computation and multiplications.
Additionally, - randomizes the encrypted entries corresponding to the row that has been queried. This requires exponentiations and multiplications. - randomizes the encrypted scores. This requires multiplications and computes the encrypted degree of the queried vertex which requires multiplications. Apart from computations of encrypted scores, in -, the CS computes a garbled circuit .
In all, the PS decrypts scores. Each decryption requires multiplications on average. To find the vertex with maximum score, in - and -, the PS compares numbers. The encrypted entries are decrypted by the PS in - and -. In addition, the PS evaluates the garbled circuit in -.
Communication Complexity: To upload the encrypted matrices, - and - requires bits and - requires bits of communications. To query, it sends only the trapdoor of size bits (aprx.).
The CS sends entries to the PS, in case of - and -. For -, the CS sends only entries. Each of these entries is of bits. In addition, - sends the garbled circuit . PS to CS communication happens only when the PS evaluates . For - and -, the PS sends only which is of bits to the client. However, the PS sends bits to the client.
Complexity for GC Computation: It can be observed that -bit , -bit , -bit , -bit and -bit blocks consist of ( XOR-gates and AND-gates), ( XOR-gates and AND-gate), ( AND-gates), ( XOR-gates and AND-gates) and ( XOR-gates and AND-gates) respectively. Thus, -bit and -bit blocks consist of ( XOR-gates and AND-gates) and ( XOR-gates and AND-gates) respectively.
In our designed garbled circuit , there are blocks and blocks. Thus, requires XOR-gates and AND-gates. However, the PS receives bits through OT for the first layer.
Thus, is the size of XOR-gates and AND-gates, whereas and are computational cost to construct and evaluate.
9 Experimental Evaluation
In this section, the experimental evaluations of our designed schemes, - and -, are presented. In our experiment, we have used a single machine for both the client and the server. All data has been assumed to be residing in main memory. The machine is with an Intel Core i7-4770 CPU and with 8-core operating at 3.40GHz. It is equipped with 8GB RAM and runs an Ubuntu 16.04 LTS 64-bit operating system. The open source PBC [17] library has been used in our implementation to support . The code is in the repository [18].
9.1 Datasets
For our experiment, we have used real-world datasets. We have taken the datasets from the SNAP datasets [9]. The collection consists of various kinds of real-world network data which includes social networks, citation networks, collaboration networks, web graphs etc.
For our experiment, we have considered the undirected graph datasets- bitcoin-alpha, ego-Facebook, Email-Enron, email-Eu-core and Wiki-Vote. The number of nodes and the edges of the graphs are shown in Table 2.
Instead of the above graphs, their subgraphs have been considered. First fixed number of vertices from the graph datasets and edges joining them have been chosen for the subgraphs. For example, for 1000, vertices with identifier have been taken for the subgraph.
9.2 Experiment Results
In our experiment, five datasets have been taken. The experiment has been done for each dataset taking extracted subgraphs with vertices 50 to 1000 incremented by 50. The number of edges in the subgraphs is shown in Fig. 5. For the pairing, 128, 256 and 512 bits prime-pairs are taken.
In our proposed schemes, the most expensive operation for the client is encrypting the matrix (). For the cloud and the proxy, score computing () and finding maximum vertex () are the most expensive operations respectively. Hence, throughout this section, we have discussed mainly these three operations.
As we have seen, in the proposed protocols, encrypting each entry of the adjacency matrix is the main operation of the encryption, the number of edges does not affect the encryption time for both - and -. This is because, irrespective of SLP schemes, the number of operations are independent of number of edges.
Similarly, time required by the cloud to compute score is independent of number of edges and depends on number of entries in the adjacency matrix i.e., . Time taken for each of the operations is shown in Fig. 6. In the figure, we have compared time for both - and - taking primes 128 bits each.
However, the time taken by the proxy to decrypt the scores is depends on the number of vertices. In -, the proxy has to decrypt entries in as well as scores in where in -, it decrypts only in scores in . So proxy takes more time in - than in -. This can be observed in Fig. 6(c).
For a query, in -, the proxy decrypts scores only for corresponding vertices that are not incident to the vertex queried for. So, only in this case, the computational time depends on the number of edges in the graph. As density of edges in a graph increases the chance of decreasing computational time for the graph increases. In Fig. 7 we have compared computational time taken by the proxy in - for different datasets.
In the above figures, we have considered only 128-bit primes. It can be observed from the experiment, the computational time depends on the security parameter. As we increase the size of the primes, the computational time grows exponentially. We have compared the change of computational time for all of the client, cloud and proxy for both - and - (see Fig. 8 and Fig. 9 respectively). However, in practical, as we keep the security bit fixed, keeping the security bits as low as possible improves the performance.
9.3 Estimation of computational cost in -
In the previous section, we have shown the experimental results for - and -. In this section, we have estimated the computational cost for -. Encryption algorithm of - is same as -. So both required same amount of time for encryption for the same dataset. To estimate query time, we have considered a random graph with vertices.
Query Time: In - the cloud computes encrypted scores and the proxy decrypts the scores as well as random numbers. The number of decryption in each group is same as -. However, in -, it requires an extra garbled circuit computation. For this, OT for 128-bit security of ECC is required which takes ms = s aprx. ([2, 15]). In addition to that, the PS evaluates the GC with XOR-gates and AND-gates. Assuming that the encryption used in each GC circuit is AES (128-bit), GC evaluation requires 2 AES decryption and the CS requires 8 encryption. As we see in [1], it requires 0.57 cycles per byte for AES. Thus, for evaluation in a single core processor, the PS requires (2*(1286000*256/8)*0.57) cycles = 46913280 cycles that takes s. Similarly, The CS requires 0.078s to construct the GC.
The estimated costs are measured with respect to a single core 2.5 GHz processor. However, in practice, the CS provides a large number of multi-core processors. As we see all the computations can be computed in parallel, the query cost can be reduced dramatically. Each of the above-mentioned costs can be improved to s with processors and cost is .
10 Introduction to
Let us define another variant of secure link prediction problem . Instead of returning the vertex with highest score, an returns indices of number of top-scored vertices.
Let, a graph is given. Then, the top- Link Prediction Problem states that given a vertex , it returns a set of vertices such that is among top- elements in . The top- link prediction scheme is said to be secure i.e., a secure top- link prediction problem scheme () if, the servers do not get any meaningful information about from its encryption or sequence of queries.
Our proposed schemes, - and -, can be extended to support queries. In -, the only change is that instead of returning only the index of the vertex with highest score, the proxy has to return the indices of the top- highest scores to the client.
11 Conclusion
In this paper, we have introduced the secure link prediction problem and discussed its security. We have presented three constructions of SLP. The first proposed scheme - has the least computational time with maximum leakage to the proxy. The second one - reduces the leakage by randomizing scores. However, it suffers high communication cost from proxy to the client. The third scheme - has minimum leakage to the proxy. Though the garbled circuit helps to reduce leakage, it increases the communication and computational cost of the cloud and the proxy servers.
Performance analysis shows that they are practical. We have implemented prototypes of first two schemes and measured the performance by doing experiment with different real-life datasets. We also estimated the cost for -. In the future, we want to make a library that support multiple queries including neighbor query, edge query, degree query, link prediction query etc.
It is to be noted that the cost of computation without privacy and security is far better. The performance has been degraded since we have added security. The performance comes at the cost of security.
Throughout the paper, we have considered unweighted graph. As a future work the schemes can be extended to weighted graphs. Moreover, we have initiated the secure link prediction problem and considered only common neighbors as score metric. As a future work, we will consider the other distance metrics like Jaccard’s coefficient, Adamic/Adar, preferential attachment, Katzβ etc. and compare the efficiency of each.
Acknowledgments
We thank Gagandeep Singh and Sameep Mehta of IBM India research for their initial valuable comments on this work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] https://www.cryptopp.com/benchmarks.html .
- 2[2] G. Asharov, Y. Lindell, T. Schneider and M. Zohner, More efficient oblivious transfer and extensions for faster secure computation, in 2013 ACM SIGSAC Conference on Computer and Communications Security, CCS’13, Berlin, Germany, November 4-8, 2013 , 2013, 535–548.
- 3[3] L. Backstrom, C. Dwork and J. M. Kleinberg, Wherefore art thou r 3579 x: anonymized social networks, hidden patterns, and structural steganography, Commun. ACM , 54 (2011), 133–141.
- 4[4] D. Boneh, E. Goh and K. Nissim, Evaluating 2-dnf formulas on ciphertexts, in Theory of Cryptography, Second Theory of Cryptography Conference, TCC 2005, Cambridge, MA, USA, February 10-12, 2005, Proceedings , 2005, 325–341.
- 5[5] C. Bösch, A. Peter, B. Leenders, H. W. Lim, Q. Tang, H. Wang, P. H. Hartel and W. Jonker, Distributed searchable symmetric encryption, in 2014 Twelfth Annual International Conference on Privacy, Security and Trust, Toronto, ON, Canada, July 23-24, 2014 , 2014, 330–337.
- 6[6] M. Chase and S. Kamara, Structured encryption and controlled disclosure, in Advances in Cryptology - ASIACRYPT 2010 - 16th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, December 5-9, 2010. Proceedings , 2010, 577–594.
- 7[7] V. Kolesnikov, A. Sadeghi and T. Schneider, Improved garbled circuit building blocks and applications to auctions and computing minima, in Cryptology and Network Security, 8th International Conference, CANS 2009, Kanazawa, Japan, December 12-14, 2009. Proceedings , 2009, 1–20.
- 8[8] V. Kolesnikov and T. Schneider, Improved garbled circuit: Free XOR gates and applications, in Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part II - Track B: Logic, Semantics, and Theory of Programming & Track C: Security and Cryptography Foundations , 2008, 486–498.
