Achieving Data Truthfulness and Privacy Preservation in Data Markets

Chaoyue Niu; Zhenzhe Zheng; Fan Wu; Xiaofeng Gao; and Guihai Chen

arXiv:1812.03280·cs.DB·December 11, 2018

Achieving Data Truthfulness and Privacy Preservation in Data Markets

Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and Guihai Chen

PDF

TL;DR

This paper introduces TPDM, a system that ensures data truthfulness and privacy preservation in data markets through cryptographic techniques, enabling efficient verification and confidentiality in large-scale data trading.

Contribution

The paper presents TPDM, a novel framework combining data truthfulness verification with privacy preservation using homomorphic encryption and identity-based signatures.

Findings

01

TPDM effectively verifies data truthfulness and privacy in large-scale markets.

02

It maintains low computational and communication overheads.

03

Performance evaluated on real-world datasets shows promising results.

Abstract

As a significant business paradigm, many online information platforms have emerged to satisfy society's needs for person-specific data, where a service provider collects raw data from data contributors, and then offers value-added data services to data consumers. However, in the data trading layer, the data consumers face a pressing problem, i.e., how to verify whether the service provider has truthfully collected and processed data? Furthermore, the data contributors are usually unwilling to reveal their sensitive personal data and real identities to the data consumers. In this paper, we propose TPDM, which efficiently integrates data Truthfulness and Privacy preservation in Data Markets. TPDM is structured internally in an Encrypt-then-Sign fashion, using partially homomorphic encryption and identity-based signature. It simultaneously facilitates batch verification, data processing,…

Figures14

Click any figure to enlarge with its caption.

Tables2

Table 1. TABLE I: Time Overheads of Key Operations.

Curve	$ℛ (𝔾_{1})$	Pairing	MapToPoint	Exponentiation
SS512	512 bits	0.999ms	3.203ms	1.179ms
MNT159	160 bits	3.102ms	0.029ms	0.413ms

Table 2. TABLE II: Computation Overhead of Identity-Based Signature Scheme.

	Preparation		Operation
Setting	Pseudo Identity Generation	Secret Key Generation	Signing
SS512	4.698ms (39.40%)	6.023ms (50.53%)	1.201ms (10.07%)
MNT159	1.958ms (57.33%)	1.028ms (30.10%)	0.429ms (12.57%)

Equations76

\overset{e}{^} (X^{a}, Z^{b}) = \overset{e}{^} (X, Z)^{ab} .

\overset{e}{^} (X^{a}, Z^{b}) = \overset{e}{^} (X, Z)^{ab} .

\overset{e}{^} (X Y, Z) = \overset{e}{^} (X, Z) \cdot \overset{e}{^} (Y, Z) .

\overset{e}{^} (X Y, Z) = \overset{e}{^} (X, Z) \cdot \overset{e}{^} (Y, Z) .

P_{0} = g_{1}^{s_{1}}, P_{1} = g_{2}^{s_{1}}, and P_{2} = g_{2}^{s_{2}}

P_{0} = g_{1}^{s_{1}}, P_{1} = g_{2}^{s_{1}}, and P_{2} = g_{2}^{s_{2}}

{\overset{e}{^}, G_{1}, G_{2}, G_{T}, q, g_{1}, g_{2}, P_{0}, P_{1}, P_{2}, P K, E (\cdot)}

{\overset{e}{^}, G_{1}, G_{2}, G_{T}, q, g_{1}, g_{2}, P_{0}, P_{1}, P_{2}, P K, E (\cdot)}

PID_{i}

PID_{i}

S K_{i}

D_{i} = E (U_{i}^{k}) ∣_{k \in K \subseteq Z^{+}},

D_{i} = E (U_{i}^{k}) ∣_{k \in K \subseteq Z^{+}},

σ_{i} = S K_{i}^{1} \cdot S K_{i}^{2}^{h (D_{i})},

σ_{i} = S K_{i}^{1} \cdot S K_{i}^{2}^{h (D_{i})},

\overset{e}{^} (i = 1 \prod n σ_{i}, g_{2})

\overset{e}{^} (i = 1 \prod n σ_{i}, g_{2})

=

D_{0} = E (ω_{i} V^{\overset{ˉ}{k}_{i}}) ∣_{\overset{ˉ}{k}_{i} \in \overset{ˉ}{K} \subseteq Z^{+}, i \in [1, ∣ \overset{ˉ}{K} ∣]},

D_{0} = E (ω_{i} V^{\overset{ˉ}{k}_{i}}) ∣_{\overset{ˉ}{k}_{i} \in \overset{ˉ}{K} \subseteq Z^{+}, i \in [1, ∣ \overset{ˉ}{K} ∣]},

γ = f (V, U_{c_{1}}, U_{c_{2}}, \dots, U_{c_{m}})

γ = f (V, U_{c_{1}}, U_{c_{2}}, \dots, U_{c_{m}})

R = E (γ) = F (D_{0}, D_{c_{1}}, D_{c_{2}}, \dots, D_{c_{m}}) .

R = E (γ) = F (D_{0}, D_{c_{1}}, D_{c_{2}}, \dots, D_{c_{m}}) .

\overset{e}{^} (σ, g_{2})

\overset{e}{^} (σ, g_{2})

=

PID_{i}^{2} ⊙ PID_{i}^{1}^{s_{1}} = RID_{i} ⊙ P_{0}^{r} ⊙ g_{1}^{s_{1} \cdot r} = RID_{i} .

PID_{i}^{2} ⊙ PID_{i}^{1}^{s_{1}} = RID_{i} ⊙ P_{0}^{r} ⊙ g_{1}^{s_{1} \cdot r} = RID_{i} .

ϵ_{1} = Pr [σ_{i_{1}} and σ_{i_{2}} are valid] .

ϵ_{1} = Pr [σ_{i_{1}} and σ_{i_{2}} are valid] .

S K_{i}^{1} \cdot S K_{i}^{2}^{h (M_{i_{1}})} \cdot S K_{i}^{1} \cdot S K_{i}^{2}^{h (M_{i_{2}})}

S K_{i}^{1} \cdot S K_{i}^{2}^{h (M_{i_{1}})} \cdot S K_{i}^{1} \cdot S K_{i}^{2}^{h (M_{i_{2}})}

=

=

ϵ_{2} = Pr [B succeeds] = \frac{2 ϵ _{1}}{n ( n - 1 )} .

ϵ_{2} = Pr [B succeeds] = \frac{2 ϵ _{1}}{n ( n - 1 )} .

\overset{e}{^} (i = 1 \prod n σ_{i}, g_{2})

\overset{e}{^} (i = 1 \prod n σ_{i}, g_{2})

=

=

=

=

=

ϵ_{3} = Pr [σ_{i}^{*} \neq = σ_{i}, σ_{i}^{*} passes verification, and σ_{i} fails],

ϵ_{3} = Pr [σ_{i}^{*} \neq = σ_{i}, σ_{i}^{*} passes verification, and σ_{i} fails],

⎩ ⎨ ⎧ \overset{e}{^} (σ_{i}^{*}, g_{2}) = \overset{e}{^} (PID_{i}^{1}, P_{1}) \overset{e}{^} (H (PID_{i}^{2})^{h (D_{i})}, P_{2}), \overset{e}{^} (σ_{i}, g_{2}) \neq = \overset{e}{^} (PID_{i}^{1}, P_{1}) \overset{e}{^} (H (PID_{i}^{2})^{h (D_{i})}, P_{2}),

⎩ ⎨ ⎧ \overset{e}{^} (σ_{i}^{*}, g_{2}) = \overset{e}{^} (PID_{i}^{1}, P_{1}) \overset{e}{^} (H (PID_{i}^{2})^{h (D_{i})}, P_{2}), \overset{e}{^} (σ_{i}, g_{2}) \neq = \overset{e}{^} (PID_{i}^{1}, P_{1}) \overset{e}{^} (H (PID_{i}^{2})^{h (D_{i})}, P_{2}),

D_{i} = (E (u_{ij}), E (u_{ij}^{2})) ∣_{j \in [1, β]} .

D_{i} = (E (u_{ij}), E (u_{ij}^{2})) ∣_{j \in [1, β]} .

σ_{i} = S K_{i}^{1} \cdot S K_{i}^{2}^{h (D_{i})},

σ_{i} = S K_{i}^{1} \cdot S K_{i}^{2}^{h (D_{i})},

D_{0} = (E (v_{j}^{2}), E (v_{j})^{- 2} = E (- 2 v_{j})) ∣_{j \in [1, β]} .

D_{0} = (E (v_{j}^{2}), E (v_{j})^{- 2} = E (- 2 v_{j})) ∣_{j \in [1, β]} .

(C_{ij}^{1}, C_{ij}^{2}, C_{ij}^{3})

(C_{ij}^{1}, C_{ij}^{2}, C_{ij}^{3})

(C_{0 j}^{1}, C_{0 j}^{2}, C_{0 j}^{3})

R_{ij}

R_{ij}

= E (v_{j}^{2} + u_{ij} (- 2 v_{j}) + u_{ij}^{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Achieving Data Truthfulness and Privacy Preservation in Data Markets

Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and Guihai Chen The authors are with the Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200000, China.

E-mail: {rvincency, zhengzhenzhe220}@gmail.com; {fwu, gao-xf, gchen}@cs.sjtu.edu.cnManuscript received 11 Aug. 2017; revised 3 Dec. 2017; accepted 19 Mar. 2018. Date of publication XXXX 2018; date of current version 13 May 2018.

Recommended for acceptance by J. Chen.

For information on obtaining reprints of this article, please send e-mail to [email protected], and reference the Digital Object Identifier below.

Digital Object Identifier no. 10.1109/TKDE.2018.2822727

Abstract

As a significant business paradigm, many online information platforms have emerged to satisfy society’s needs for person-specific data, where a service provider collects raw data from data contributors, and then offers value-added data services to data consumers. However, in the data trading layer, the data consumers face a pressing problem, i.e., how to verify whether the service provider has truthfully collected and processed data? Furthermore, the data contributors are usually unwilling to reveal their sensitive personal data and real identities to the data consumers. In this paper, we propose TPDM, which efficiently integrates data Truthfulness and Privacy preservation in Data Markets. TPDM is structured internally in an Encrypt-then-Sign fashion, using partially homomorphic encryption and identity-based signature. It simultaneously facilitates batch verification, data processing, and outcome verification, while maintaining identity preservation and data confidentiality. We also instantiate TPDM with a profile matching service and a distribution fitting service, and extensively evaluate their performances on Yahoo! Music ratings dataset and 2009 RECS dataset, respectively. Our analysis and evaluation results reveal that TPDM achieves several desirable properties, while incurring low computation and communication overheads when supporting large-scale data markets.

Index Terms:

Data markets, data truthfulness, privacy preservation

1 Introduction

In the era of big data, society has developed an insatiable appetite for sharing personal data. Realizing the potential of personal data’s economic value in decision making and user experience enhancement, several open information platforms have emerged to enable person-specific data to be exchanged on the Internet [1, 2, 3, 4, 5]. For example, Gnip, which is Twitter’s enterprise API platform, collects social media data from Twitter users, mines deep insights into customized audiences, and provides data analysis solutions to more than 95 $\%$ of the Fortune 500 [1].

However, there exists a critical security problem in these market-based platforms, i.e., it is difficult to guarantee the truthfulness in terms of data collection and data processing, especially when the privacies of the data contributors are needed to be preserved. Let’s examine the role of a pollster in the presidential election as follows. As a reliable source of intelligence, the Gallup Poll [6] uses impeccable data to assist presidential candidates in identifying and monitoring economic and behavioral indicators. In this scenario, simultaneously ensuring data truthfulness and preserving privacy require the Gallup Poll to convince the presidential candidates that those indicators are derived from live interviews without leaking any interviewer’s real identity (e.g., social security number) or the content of her interview. If raw data sets for drawing these indicators are mixed with even a small number of bogus or synthetic samples, it will exert bad influence on the final election result.

Ensuring data truthfulness and protecting the privacies of data contributors are both important to the long term healthy development of data markets. On one hand, the ultimate goal of the service provider in a data market is to maximize her profit. Therefore, in order to minimize the expenditure for data acquisition, an opportunistic way for the service provider is to mingle some bogus or synthetic data into the raw data set. Yet, to reduce operation cost, a cunning service provider may provide data services based on a subset of the whole raw data set, or even return a fake result without processing the data from designated sources. However, if such speculative and illegal behaviors cannot be identified and prohibited, it will cause heavy losses to data consumers, and thus destabilize the data market. On the other hand, while unleashing the power of personal data, it is the bottom line of every business to respect the privacies of data contributors. The debacle, which follows AOL’s public release of “anonymized” search records of its customers, highlights the potential risk to individuals in sharing personal data with private companies [7]. Besides, according to the survey report of 2016 TRUSTe/NCSA Consumer Privacy Infographic - US Edition [8], $89\%$ of consumers say they avoid companies that do not respect privacy. Therefore, the content of raw data should not be disclosed to the data consumers to guarantee data confidentiality, even if the real identities of the data contributors are hidden.

To integrate data truthfulness and privacy preservation in a practical data market, there are four major challenges. The first and the thorniest design challenge is that verifying the truthfulness of data collection and preserving the privacy seem to be contradictory objectives. Ensuring the truthfulness of data collection allows the data consumers to verify the validities of data contributors’ identities and the content of raw data, whereas privacy preservation tends to prevent them from learning these confidential contents. Specifically, the property of non-repudiation in classical digital signature schemes implies that the signature is unforgeable, and any third party is able to verify the authenticity of a data submitter using her public key and the corresponding digital certificate, i.e., the truthfulness of data collection in our model. However, the verification in digital signature schemes requires the knowledge of raw data, and can easily leak a data contributor’s real identity [9]. Regarding a message authentication code (MAC), each pair of a data contributor and a data consumer need to agree on a shared secret key, which is unpractical in data markets.

Yet, another challenge comes from data processing, which makes verifying the truthfulness of data collection even harder. Nowadays, more and more data markets provide data services rather than directly offering raw data. The following reasons account for this trend: 1) For the data contributors, they have severe privacy concerns [8]. Nevertheless, the service-based trading mode, which has hidden the sensitive raw data, alleviates their concerns; 2) For the service provider, comprehensive and insightful data services can bring in more profits [10]; 3) For the data consumers, data copyright infringement [11] is serious. However, such a data trading mode differs from most of conventional data sharing scenarios, e.g., data publishing [12]. Besides, the data services, as the results of data processing, may no longer be semantically consistent with the raw data [13], which makes the data consumers hard to believe the truthfulness of data collection. In addition, the digital signatures on raw data become invalid for the data services, which discourages the data consumers from doing verification as mentioned above. Moreover, although data provenance [14] helps to determine the derivation histories of data processing results, it cannot guarantee the truthfulness of data collection. While knowledge provenance [15], an enhanced version of data provenance, tackles the deficiency of data provenance, but it breaks the property of identity preservation.

The third challenge lies in how to guarantee the truthfulness of data processing, under the information asymmetry between the data consumers and the service provider due to data confidentiality. In particular, to ensure data confidentiality against the data consumers, the service provider can employ a conventional symmetric/asymmetric cryptosystem, and let the data contributors encrypt their raw data. Unfortunately, a hidden problem arisen is that the data consumers fail to verify the correctness and completeness of returned data services. Even worse, some greedy service providers may exploit this vulnerability to reduce operation cost during the execution of data processing, e.g., they might return an incomplete data service without processing the whole data set, or even return an outright fake result without processing the data from designated data sources.

Last but not least, the fourth design challenge is the efficiency requirement of data markets, especially for data acquisition, i.e., the service provider should be able to collect data from a large number of data contributors with low latency. Due to the timeliness of some kinds of person-specific data, the service provider has to periodically collect fresh raw data to meet the diverse demands of high-quality data services. For example, 25 billion data collection activities take place on Gnip every day [1]. Meanwhile, the service provider needs to verify data authentication and data integrity. One basic approach is to let each data contributor sign her raw data. However, classical digital signature schemes, which verify the received signatures one after another, may fail to satisfy the stringent time requirement of data markets. Furthermore, the maintenance of digital certificates under the traditional Public Key Infrastructure (PKI) also incurs significant communication overhead. Under such circumstances, verifying a large number of signatures sequentially certainly becomes the processing bottleneck at the service provider.

In this paper, by jointly considering above four challenges, we propose TPDM, which achieves both data Truthfulness and Privacy preservation in Data Markets. TPDM first exploits partially homomorphic encryption to construct a ciphertext space, which enables the service provider to launch data services and the data consumers to verify the truthfulness of data processing, while maintaining data confidentiality. In contrast to classical digital signature schemes, which are operated over plaintexts, our new identity-based signature scheme is conducted in the ciphertext space. Furthermore, each data contributor’s signature is derived from her real identity, and is unforgeable against the service provider or other external attackers. This appealing property can convince the data consumers that the service provider has truthfully collected data. To reduce the latency caused by verifying a bulk of signatures, we propose a two-layer batch verification scheme, which is built on the bilinear property of admissible pairing. At last, TPDM realizes identity preservation and revocability by carefully adopting ElGamal encryption and introducing a semi-honest registration center.

We summarize our key contributions as follows.

$\bullet$ To the best of our knowledge, TPDM is the first secure mechanism for data markets achieving both data truthfulness and privacy preservation.

$\bullet$ TPDM is structured internally in a way of Encrypt-then-Sign using partially homomorphic encryption and identity-based signature. It enforces the service provider to truthfully collect and process real data. Besides, TPDM incorporates a two-layer batch verification scheme with an efficient outcome verification scheme, which can drastically reduce computation overhead.

$\bullet$ We instructively instantiate TPDM with two kinds of practical data services, namely profile matching and distribution fitting. Besides, we implement these two concrete data markets, and extensively evaluate their performances on Yahoo! Music ratings dataset and 2009 RECS dataset. Our analysis and evaluation results reveal that TPDM achieves good effectiveness and efficiency in large-scale data markets. Specifically, for profile matching, when supporting as many as 1 million data contributors in one session of data acquisition, the computation and communication overheads at the service provider are 0.930s and 0.235KB per matching with 10 attributes in each profile, respectively. In addition, the outcome verification’s overhead per matching is only $1.17\%$ of the original similarity evaluation’s cost.

The remainder of this paper is organized as follows. In Section 2, we introduce system model, adversary model, and technical preliminary. We show the detailed design of TPDM in Section 3, and analyze its security in Section 4. In Section 5, we elaborate on the applications of TPDM to profile matching and distribution fitting. The evaluation results are presented in Section 6. We briefly review related work in Section 7. We conclude the paper in Section 8.

2 Preliminaries

In this section, we first describe a general system model for data markets. We then introduce the adversary model, and present corresponding security requirements on the design. We finally review technical preliminary.

2.1 System Model

As shown in Fig. 1, we consider a two-layer system model for data markets. The model has a data acquisition layer and a data trading layer. There are four major kinds of entities, including data contributors, a service provider, data consumers, and a registration center.

In the data acquisition layer, the service provider procures massive raw data from the data contributors, such as social network users, mobile smart devices, smart meters, and so on. In order to incentivize more data contributors to actively submit high-quality data, the service provider needs to reward those valid ones to compensate their data collection costs. For the sake of security, each registered data contributor is equipped with a tamper-proof device. The tamper-proof device can be implemented in the form of either specific hardware [16] or software [17]. It prevents any adversary from extracting the information stored in the device, including cryptographic keys, codes, and data.

We consider that the service provider is cloud based, and has abundant computing resources, network bandwidths, and storage space. Besides, she tends to offer semantically rich and value-added data services to data consumers rather than directly revealing sensitive raw data, e.g., social network analyses, probability distributions, personalized recommendations, and aggregate statistics.

The registration center maintains an online database of registrations, and assigns each registered data contributor an identity and a password to activate the tamper-proof device. Besides, she maintains an official website, called certificated bulletin board [18, 19], on which the legitimate system participants can publish essential information, e.g., whitelists, blacklists, resubmit-lists, and reward-lists of data contributors. Yet, another duty of the registration center is to set up the parameters for a signature scheme and a cryptosystem. To avoid being a single point of failure or bottleneck, redundant registration centers, which have identical functionalities and databases, can be installed.

2.2 Adversary Model

In this section, we focus on attacks in practical data markets, and define corresponding security requirements.

First, we consider that a malicious data contributor or an external attacker may impersonate other legitimate data contributors to submit possibly bogus raw data. Besides, some malicious attackers may deliberately modify raw data during submission. Hence, the service provider needs to confirm that raw data are indeed sent unaltered by registered data contributors, i.e., to guarantee data authentication and data integrity in the data acquisition layer.

Second, the service provider in the data market might be greedy, and attempts to maximize her profit by launching the following two types of attacks:

•

Partial data collection: To cut down the expenditure on data acquisition, the service provider may insert bogus data into the raw data set.

•

No/Partial data processing: To reduce the operation cost, the service provider may try to return a fake result without processing the data from designated sources, or to provide data services based on a subset of the whole raw data set.

On one hand, to counter partial data collection attack, each data consumer should be enabled to verify whether raw data are really provided by registered data contributors, i.e., truthfulness of data collection in the data trading layer. On the other hand, the data consumer should have the capability to verify the correctness and completeness of a returned data service in order to combat no/partial data processing attack. We here use the term truthfulness of data processing in the data trading layer to represent the integrated requirement of correctness and completeness of data processing results.

Third, we assume that some honest-but-curious data contributors, the service provider, the data consumers, and external attackers, e.g., eavesdroppers, may glean sensitive information from raw data, and recognize real identities of data contributors for illegal purposes, e.g., an attacker can infer a data contributor’s home location from her GPS records. Hence, raw data of a data contributor should be kept secret from these system participants, i.e., data confidentiality. Besides, an outside observer cannot reveal a data contributor’s real identity by analysing data sets sent by her, i.e., identity preservation.

Fourth, a minority of data contributors may try to behave illegally, e.g., launching attacks as mentioned above, if there is no punishment. To prevent this threat, the registration center should have the ability to retrieve a data contributor’s real identity, and revoke it from further usage, when her signature is in dispute, i.e., traceability and revocability.

Last but not least, the semi-honest registration center may misbehave by trying to link a data contributor’s real identity with her raw data. Besides, if there is no detection or verification in the cryptosystem, she may deliberately corrupt the decrypted results. However, to guarantee full side information protection, the requirement on the registration center is that she cannot leak decrypted samples to irrelevant system participants. Moreover, she is required to perform an acknowledged number of decryptions in a specific data service [20], which should be publicly posted on the certificated bulletin board.

2.3 Admissible Pairing

In this section, we introduce admissible pairing, which is the basis of our design.

The multiplicative cyclic groups $\mathbb{G}_{1},\mathbb{G}_{2}$ , and $\mathbb{G}_{T}$ are of the same prime order $q$ . Let $g_{1}$ be a generator of $\mathbb{G}_{1}$ , and $g_{2}$ be a generator of $\mathbb{G}_{2}$ . An asymmetric bilinear map is a map $\hat{e}:\mathbb{G}_{1}\times\mathbb{G}_{2}\rightarrow\mathbb{G}_{T}$ with the following three properties:

•

Bilinearity: $\forall X,Y\in\mathbb{G}_{1},\forall Z\in\mathbb{G}_{2},\forall a,b\in\mathbb{Z}{{}_{q}^{*}}$ ,

[TABLE]

In addition,

[TABLE]

•

Non-degeneracy: $\hat{e}(g_{1},g_{2})\neq 1_{\mathbb{G}_{T}}.$

•

Computability: Given $X\in\mathbb{G}_{1},Z\in\mathbb{G}_{2}$ , there exists an efficient algorithm to compute $\hat{e}(X,Z)$ .

We call such a bilinear map $\hat{e}$ an admissible pairing, which can be constructed based on elliptic curves with modified Weil [21] or Tate pairing [22]. Each operation for computing $\hat{e}(X,Z)$ is called pairing operation. The group that possesses such a map $\hat{e}$ is called a bilinear group, where the Decisional Diffie-Hellman (DDH) problem is easy, while the Computational Diffie-Hellman (CDH) problem is hard [21]. For example, given $(g_{1},{g_{1}}^{a},{g_{1}}^{b})$ for unknown $(a,b)$ , it is computationally intractable to compute ${g_{1}}^{ab}$ .

3 Design of TPDM

In this section, we propose TPDM, which integrates data truthfulness and privacy preservation in data markets.

3.1 Design Rationales

Using the terminology from the signcryption scheme [23], TPDM is structured internally in a way of Encrypt-then-Sign, using partially homomorphic encryption and identity-based signature. It enforces the service provider to truthfully collect and process real data. The essence of TPDM is to first synchronize data processing and signature verification into the same ciphertext space, and then to tightly integrate data processing with outcome verification via the homomorphic properties. With the help of the architectural overview in Fig. 2, we illustrate the design rationales as follows.

Space Construction. The thorniest problem is how to enable the data consumer to verify the validnesses of signatures, while maintaining data confidentiality. If the signature scheme is applied to the plaintext space, the data consumer needs to know the content of raw data for verification. However, if we employ a conventional public key encryption scheme to construct the ciphertext space, the service provider has to decrypt and then process the data. Even worse, such a construction is vulnerable to the no/partial data processing attack, because the data consumer, only knowing the ciphertexts, fails to verify the correctness and completeness of the data service. Thus, the greedy service provider may reduce operation cost, by returning a fake result or manipulating the inputs of data processing. Therefore, we turn to the partially homomorphic cryptosystems for encryption, whose properties facilitate both data processing and outcome verification on the ciphertexts.

Batch Verification. After constructing the ciphertext space, we can let each data contributor digitally sign her encrypted raw data. Given the ciphertext and the signature, the service provider is able to verify data authentication and data integrity. Besides, we can treat the data consumer as a third party to verify the truthfulness of data collection. However, an immediate question arisen is that the sequential verification schema may fail to meet the stringent time requirement of large-scale data markets. In addition, the maintenance of digital certificates also incurs significant communication overhead. To tackle these two problems, we propose an identity-based signature scheme, which supports two-layer batch verifications, while incurring small computation and communication overheads.

Breach Detection. Yet, another problem in existing identity-based signature schemes is that the real identities are viewed as public parameters, and are not well protected. On the other hand, if all the real identities are hidden, none of the misbehaved data contributors can be identified. To meet these two seemly contradictory requirements, we employ ElGamal encryption to generate pseudo identities for registered data contributors, and introduce a new third party, called registration center. Specifically, the registration center, who owns the private key, is the only authorized party to retrieve the real identities, and to revoke those malicious accounts from further usage.

3.2 Design Details

Following the guidelines given above, we now introduce TPDM in detail. TPDM consists of 5 phases: initialization, signing key generation, data submission, data processing and verifications, and tracing and revocation.

Phase I: Initialization

We assume that the registration center sets up the system parameters at the beginning of data trading as follows:

$\bullet$ The registration center chooses three multiplicative cyclic groups $\mathbb{G}_{1}$ , $\mathbb{G}_{2}$ , and $\mathbb{G}_{T}$ with the same prime order $q$ . Besides, $g_{1}$ is a generator of $\mathbb{G}_{1}$ , and $g_{2}$ is a generator of $\mathbb{G}_{2}$ . Moreover, these three cyclic groups compose an admissible pairing $\hat{e}:\mathbb{G}_{1}\times\mathbb{G}_{2}\rightarrow\mathbb{G}_{T}$ .

$\bullet$ The registration center randomly picks $s_{1},s_{2}\in\mathbb{Z}{{}_{q}^{*}}$ as her two master keys, and then computes

[TABLE]

as public keys. The master keys $s_{1},s_{2}$ are preloaded into each registered data contributor’s tamper-proof device.

$\bullet$ The registration center sets up parameters for a partially homomorphic cryptosystem: a private key $\mathcal{SK}$ , a public key $\mathcal{PK}$ , an encryption scheme $E(\cdot)$ , and a decryption scheme $D(\cdot)$ .

$\bullet$ To activate the tamper-proof device, each registered data contributor $o_{i}$ is assigned with a “real” identity $\emph{RID}_{i}\in\mathbb{G}_{1}$ and a password $\emph{PW}_{i}$ . Here, $\emph{RID}_{i}$ uniquely identifies $o_{i}$ , while $\emph{PW}_{i}$ is required in the access control process.

$\bullet$ The system parameters

[TABLE]

are published on the certificated bulletin board.

Phase II: Signing Key Generation

To achieve anonymous authentication in data markets, the tamper-proof device is utilized to generate a pair of pseudo identity $\emph{PID}_{i}$ and secret key $SK_{i}$ for each registered data contributor $o_{i}$ :

[TABLE]

Here, $r$ is a per-session random nonce, $\odot$ represents the Exclusive-OR (XOR) operation, and $H$ ( $\cdot$ ) is a MapToPoint hash function [21], i.e., $H(\cdot):\{0,1\}^{*}\rightarrow\mathbb{G}_{1}$ . Besides, $\emph{PID}_{i}$ is an ElGamal encryption [24] of the real identity $\emph{RID}_{i}$ over the elliptic curve, while $SK_{i}$ is generated accordingly by exploiting identity-based encryption (IBE) [21].

Phase III: Data Submission

For the submission of raw data, we need to jointly consider several security issues, including confidentiality, authentication, and integrity. To provide data confidentiality, we employ partially homomorphic encryption. Besides, to guarantee data authentication and data integrity, the encrypted raw data should be signed before submission, and also should be verified after reception.

$\blacktriangleright$ Data Encryption

Ahead of submission, each data contributor $o_{i}$ encrypts her raw data $U_{i}$ to different powers under the public key $\mathcal{PK}$ , and gets the ciphertext vector

[TABLE]

where $\mathbb{K}$ is a set of positive integers, and is determined by the requirements of data services, e.g., the location-based aggregate statistics [20] may require $\mathbb{K}=\{1\}$ , whereas in the fine-grained profile matching [25], $\mathbb{K}=\{1,2\}$ .

In general, compared with the time-consuming computation on ciphertexts, the evaluation of plaintexts is quite more efficient. Therefore, we let each data contributor encrypt her raw data to different powers, which can benefit an optimization in data processing while incurring a small overhead at each data contributor.

$\blacktriangleright$ Encrypted Data Signing

After encryption, each data contributor $o_{i}$ computes the signature $\sigma_{i}$ on the ciphertext vector $\vec{D}_{i}$ using her secret key:

[TABLE]

where “ $\cdot$ ” denotes the group operation in $\mathbb{G}_{1}$ , $h(\cdot)$ is a one-way hash function, e.g., SHA-1 [26], and $D_{i}$ is derived by concatenating all the elements of $\vec{D}_{i}$ together.

Eventually, the data contributor $o_{i}$ submits her tuple $\langle\emph{PID}_{i},\vec{D}_{i},\sigma_{i}\rangle$ to the service provider. Once receiving the tuple, the service provider is required to post the pseudo identity $\emph{PID}_{i}$ on the certificated bulletin board for fear of receiver-repudiation. In addition, to prevent a registered data contributor from using the same pair of pseudo identity and secret key for multiple times in different sessions of data acquisition (analogous to the replay attack scenario considered in [17]), one intuitive way is to let the service provider store those used pseudo identities for duplication check later. Yet, another feasible way is to encapsulate the signing phase into the tamper-proof device.

Phase IV: Data Processing and Verifications

In this phase, we consider two-layer batch verifications, i.e., verifications conducted by both the service provider and the data consumer. Between the two-layer batch verifications, we introduce data processing and signatures aggregation done by the service provider. At last, we present outcome verification conducted by the data consumer.

Before introducing the verifications, we first discuss the time period $\tau$ of data acquisition. In practice, $\tau$ is determined by the service provider, and is based on the timeliness of different data items. For example, stock data is streaming with a minimum update frequency of 1 minute on Investing [27], while smart meters collect the electrical usages every 15 minutes [28]. In what follows, we focus on one time period of data acquisition.

$\blacktriangleright$ First-Layer Batch Verification

We assume that the service provider receives a bundle of data tuples from $n$ distinct data contributors, denoted as $\{\langle\emph{PID}_{i},\vec{D}{{}_{i}},\sigma_{i}\rangle|i\in[1,n]\}$ , by the end of a time period. To prevent a malicious data contributor from impersonating other legitimate ones to submit possibly bogus data, the service provider needs to verify the validnesses of signatures by checking whether

[TABLE]

Compared with single signature verification, this batch verification scheme can dramatically reduce the verification latency, especially when verifying a large number of signatures. Since the three pairing operations in Equation (6) dominate the overall computation cost, the batch verification time is almost a constant if the time overhead of $n$ MapToPoint hashings and $n$ exponentiations is small enough to be emitted. However, in a practical data market, when the number of data contributors is too large, the expensive pairing operations cannot dominate the verification time. We will expand on this point in Section 6.1.

$\blacktriangleright$ Data Processing and Signatures Aggregation

Instead of directly trading raw data for revenue, more and more service providers tend to trade value-added data services. Typical examples of data services include social network analyses, personalized recommendations, location-based services, and probability distribution fittings.

To facilitate generating a precise and customized strategy in targeted data services, e.g., personalized recommendation and locate-based service, the data consumer also needs to provide her own ciphertext vector $\vec{D}_{0}$ and a threshold $\delta$ . Here, $\vec{D}_{0}$ is generated from the data consumer’s information $V$ as follows:

[TABLE]

where $\bar{k}_{i},\omega_{i}$ are parameters determined by a concrete data service. For example, the profile-matching service in Section 5.1 requires $\bar{k}_{i}\in\{1,2\}$ and $\omega_{i}\in\{-2,1\}$ .

Now, the service provider can process the collected data as required by the data consumer. We model such a data processing in the plaintext space as

[TABLE]

for generality. Accordingly, $f$ can be equivalently evaluated in the ciphertext space using

[TABLE]

The equivalent transformation from $f$ to $F$ is based on the properties of the partially homomorphic cryptosystem, e.g., homomorphic addition $\oplus$ and homomorphic multiplication $\otimes$ , which are arithmetic operations on the ciphertexts that are equivalent to the usual addition and multiplication on the plaintexts, respectively. Hence, only polynomial functions can be computed in a straightforward way. Nevertheless, most non-polynomial functions, e.g., sigmoid and rectified linear activation functions in machine learning, can be well approximated/handled by polynomials [29]. Besides, the function $f$ is determined by the data processing method, and the choice of a specific partially homomorphic cryptosystem should support the basic operation(s) in $f$ . For example, the primitive of aggregate statistics [20] is addition, so the Paillier scheme [30] can be the first choice; while the distance calculation [31] requires one more multiplication, thus, the BGN scheme [32] may be preferred. Furthermore, in Equation (9), $\vec{D}_{0}$ is the data consumer’s ciphertext vector, and $\vec{D}_{c_{i}}$ indicates that the data contributor $o_{c_{i}}$ is one of the $m$ valid data contributors. More precisely, $m$ is the size of whitelist on the certificated bulletin board, and its default value is $n$ . However, if either of the two-layer batch verifications fails, $m$ will be updated in the following tracing and revocation phase. For brevity in notations, we use $\mathbb{C}$ to denote the indexes of $m$ valid data contributors, i.e., $\mathbb{C}=\{c_{1},c_{2},\ldots,c_{m}\}$ .

Next, the service provider sends $R$ to the registration center for decryption. We note that the registration center can only perform decryption for acknowledged times, which should be publicly announced on the certificated bulletin board. For example, in the aggregate statistic over a valid dataset of size $m$ , the registration center just needs to do one decryption, and cannot do more than required. The reason is that the service provider can still obtain the correct aggregate result by decrypting all $m$ encrypted raw data.

Upon getting the plaintext $\gamma$ , the service provider can compare it with $\delta$ , and obtain the comparison result $\vartheta$ . For convenience, the concrete-value result $\gamma$ and the comparison result $\vartheta$ are collectively called outcome. We note that the outcome may be in different formats, e.g., average speed in location-based aggregate statistic [20], shopping suggestion in private recommendation [33], and friending strategy in social networking [25]. We assume that the outcome involves $\phi$ candidate data contributors, and the subscripts of their pseudo identities are denoted as $\mathbb{I}=\left\{I_{1},I_{2},\cdots,I_{\phi}\right\}.$

After data processing, to further reduce communication overhead, the service provider can aggregate $\phi$ candidate signatures into one signature. In our scheme, the aggregate signature $\sigma=\prod_{i\in\mathbb{I}}\sigma_{i}.$ Then, the service provider sends the final tuple to the data consumer, including the data service outcome, the aggregate signature $\sigma$ , the index set $\mathbb{I}$ , and $\phi$ candidate ciphertexts $\{\vec{D}_{i}|i\in\mathbb{I}\}$ .

$\blacktriangleright$ Second-Layer Batch Verification

Similar to the first-layer batch verification, the data consumer can verify the legitimacies of $\phi$ candidate data sources by checking whether

[TABLE]

Here, the pseudo identities on the right hand side of the above equation can be fetched from the certificated bulletin board according to the index set $\mathbb{I}$ .

$\blacktriangleright$ Outcome Verification

The homomorphic properties also enable the data consumer to verify the truthfulness of data processing. Under the condition that the data consumer knows her plaintext $V$ , all the cross terms involving $\vec{D}_{0}$ in Equation (9) can be evaluated through multiplication by a constant $V$ . Hence, part of the most time-consuming homomorphic multiplications in the original data processing are no longer needed in outcome verification. Besides, if for correctness, the data consumer just needs to evaluate on the $\phi$ candidate ciphertexts. Of course, she reserves the right to require the service provider to send her the other $(m-\phi)$ valid ones, on which the completeness can be verified.

In fact, if $\phi$ or $m-\phi$ is too large, the data consumer can take the strategy of random sampling for verification, where the $m$ valid pseudo identities on the certificated bulletin board can be used for the sampling indexes. Random sampling is a tradeoff between security and efficiency, and we shall illustrate its feasibility in Section 5 and Section 6.1.

Phase V: Tracing and Revocation

The two-layer batch verifications only hold when all the signatures are valid, and fail even when there is a single invalid signature. In practice, a signature batch may contain invalid one(s) caused by accidental data corruption or possibly malicious activities launched by an external attacker. Traditional batch verifier would reject the entire batch, even if there is a single invalid signature, and thus waste the other valid data items. Therefore, tracing and/or recollecting invalid data items and their corresponding signatures are important in practice. If the second-layer batch verification fails, the data consumer can require the service provider to find out the invalid signature(s). Similarly, if the first-layer batch verification fails, the service provider has to find out the invalid one(s) by herself.

To extract invalid signatures, as shown in Algorithm 1, we propose $\ell$ -depth-tracing algorithm. We consider that the batch contains $n$ signatures. In addition, the whitelist, the blacklist, and the resubmit-list of pseudo identities are global variables, and are initialized as empty sets. If a batch verification fails, the service provider first finds out the mid-point as $mid=\lfloor\frac{1+n}{2}\rfloor$ (Line 9). Then, she performs batch verification on the first half ( $head$ to $mid$ ) (Line 10) and the second half ( $mid+1$ to $tail$ ) (Line 11), respectively. If either of these two halves causes a failure, the service provider repeats the same process on it. Otherwise, she adds the pseudo identities from the valid half to the whitelist (Line 4-5). The recursive process terminates, if validnesses of all the signatures have been identified or a pre-defined limit of search depth is reached (Line 2). A special case is the single signature verification, in which the service provider can determine its validness (Line 6-7). After this algorithm, the service provider can form the resubmit-list of pseudo identities by excluding those in the other two lists.

According to the blacklist on the certificated bulletin board, the registration center can reveal the real identities of those invalid data contributors. Given the data contributor $o_{i}$ ’s pseudo identity $\emph{PID}_{i}$ , the registration center can use her master key $s_{1}$ to perform revealing by computing

[TABLE]

Upon getting a misbehaved data contributor’s real identity, the registration center can revoke it from further usage if necessary, e.g., deleting her account from the online registration database. Thus, the revoked data contributor can no longer activate the tamper-proof device, which indicates that she does not have the right to submit data any more.

4 Security Analysis

In this section, we analyze the security of TPDM in terms of the desirable properties preconcerted in Section 2.2.

4.1 Data Authentication and Data Integrity

Data authentication and data integrity are regarded as two basic security requirements in the data acquisition layer. The signature in TPDM $\sigma_{i}=SK{{}_{i}^{1}}\cdot SK{{}_{i}^{2}}^{h(D_{i})}$ is actually a one-time identity-based signature. We now prove that if the Computational Diffie-Hellman (CDH) problem in the bilinear group $\mathbb{G}_{1}$ is hard [21], an attacker cannot successfully forge a valid signature on behalf of any registered data contributor except with a negligible probability.

First, we consider Game 1 between a challenger and an attacker as follows:

Setup: The challenger starts by giving the attacker the system parameters $g_{1}$ and $P_{0}$ . The challenger also offers a pseudo identity $\emph{PID}_{i}=\langle\emph{PID}_{i}^{1},\emph{PID}_{i}^{2}\rangle$ to the attacker, which simulates the condition that the pseudo identities are posted on the certificated bulletin board in TPDM.
Query: We assume that the attacker does not know how to compute the MapToPoint hash function $H(\cdot)$ and the one-way hash function $h(\cdot)$ . However, she can ask the challenger for the value $H(\emph{PID}{{}_{i}^{2}})$ and the one-way hashes $h(\cdot)$ for up to $n$ different messages.
Challenge: The challenger asks the attacker to pick two random messages $M_{i_{1}}$ and $M_{i_{2}}$ , and to generate two corresponding signatures $\sigma_{i_{1}}$ and $\sigma_{i_{2}}$ on behalf of the data contributor $o_{i}$ .
Guess: The attacker returns $\langle M_{i_{1}},\sigma_{i_{1}}\rangle$ and $\langle M_{i_{2}},\sigma_{i_{2}}\rangle$ to the challenger. We denote the attacker’s advantage in winning Game 1 to be

[TABLE]

We further claim that our signature scheme is adaptively secure against existential forgery, if $\epsilon_{1}$ is negligible. We prove our claim using Game 2 by reduction [34].

Second, we assume that there exists a probabilistic polynomial-time algorithm $\mathcal{A}$ such that it has the same non-negligible advantage $\epsilon_{1}$ as the attacker in Game 1. Then, we will construct Game 2, in which an attacker $\mathcal{B}$ can make use of $\mathcal{A}$ to break the CDH assumption with non-negligible probability. In particular, $\mathcal{B}$ is given $(g_{1},{g_{1}}^{a},{g_{1}}^{b},{g_{1}}^{c},d)$ for unknown $(a,b,c)$ and known $d$ , and is asked to compute ${g_{1}}^{2ab}\cdot{g_{1}}^{cd}$ . We note that computing ${g_{1}}^{2ab}\cdot{g_{1}}^{cd}$ is as hard as computing ${g_{1}}^{ab}$ , which is the original CDH problem. We present the details of Game 2 as follows:

Setup: $\mathcal{B}$ makes up the parameters $g_{1}$ and $P_{0}={g_{1}}^{a}$ , where $a$ plays the role of the master key $s_{1}$ in TPDM. Besides, $\mathcal{B}$ also provides $\mathcal{A}$ with a pseudo identity $\emph{PID}_{i}=\langle\emph{PID}_{i}^{1},\emph{PID}_{i}^{2}\rangle=\langle{g_{1}}^{b},\emph{RID}_{i}\odot{g_{1}}^{ab}\rangle$ . Here, $b$ functions as the random nonce $r$ in TPDM.
Query: $\mathcal{A}$ then asks $\mathcal{B}$ for the value $H(\emph{PID}_{i}^{2})^{s_{2}}$ , and $\mathcal{B}$ replies with ${g_{1}}^{c}$ . We note that $H(\emph{PID}_{i}^{2})$ is the only MapToPoint hash operation needed to forge the data contributor $o_{i}$ ’s valid signatures. Besides, $\mathcal{A}$ picks $n$ random messages, and requests $\mathcal{B}$ for their one-way hash values $h(\cdot)$ . $\mathcal{B}$ answers these queries using a random oracle: $\mathcal{B}$ maintains a table to store all the answers. Upon receiving a message, if the message has been queried before, $\mathcal{B}$ answers with the stored value; otherwise, she answers with a random value, which is stored into the table for later usage. Except for the $x$ -th and $y$ -th queries (i.e., messages $M_{x}$ and $M_{y}$ ), $\mathcal{B}$ answers with the values $d_{1}$ and $d_{2}$ , respectively, where $d_{1}+d_{2}=d$ .
Challenge: When the query phase is over, $\mathcal{B}$ asks $\mathcal{A}$ to choose two random messages $M_{i_{1}}$ and $M_{i_{2}}$ , and to sign them on behalf of the data contributor $o_{i}$ .
Guess: $\mathcal{A}$ returns two signatures $\sigma_{i_{1}}$ and $\sigma_{i_{2}}$ on the messages $M_{i_{1}}$ and $M_{i_{2}}$ to $\mathcal{B}$ . We note that $M_{i_{1}}$ and $M_{i_{2}}$ must be within the $n$ queried messages; otherwise, $\mathcal{A}$ does not know $h(M_{i_{1}})$ and $h(M_{i_{2}})$ . Furthermore, if $M_{i_{1}}=M_{x}$ and $M_{i_{2}}=M_{y}$ or $M_{i_{1}}=M_{y}$ and $M_{i_{2}}=M_{x}$ , $\mathcal{B}$ then computes $\sigma_{i_{1}}\cdot\sigma_{i_{2}}$ , which is equivalent to:

[TABLE]

After obtaining $\sigma_{i_{1}}\cdot\sigma_{i_{2}}$ , $\mathcal{B}$ solves the given CDH instance successfully. We note that $\mathcal{A}$ ’s advantage in breaking TPDM is $\epsilon_{1}$ , and the probability that $\mathcal{A}$ picks $M_{x}$ and $M_{y}$ is $\frac{2}{n(n-1)}$ . Thus, the probability of $\mathcal{B}$ ’s success is:

[TABLE]

Since $\epsilon_{1}$ is non-negligible, $\mathcal{B}$ can solve the CDH problem with the non-negligible probability $\epsilon_{2}$ , which contradicts with the assumption that the CDH problem is hard. This completes our proof. Therefore, our signature scheme is adaptively secure under the random oracle model.

Last but not least, the first-layer batch verification scheme in TPDM is correct if and only if Equation (6) holds. By capitalizing the bilinear property of admissible pairing, the left hand side of Equation (6) expands as:

[TABLE]

which is the right hand side as required.

In conclusion, our novel identity-based signature scheme is provably secure, and the properties of data authentication and data integrity are achieved.

4.2 Truthfulness of Data Collection

To guarantee the truthfulness of data collection, we need to combat the partial data collection attack defined in Section 2.2. We note that it is just a special case of Game 1 in Section 4.1, where the service provider is the attacker. Hence, it is infeasible for the service provider to forge valid signatures on behalf of any registered data contributor. Such an appealing property prevents the service provider from injecting spurious data undetectably, and enforces her to truthfully collect real data.

Similar to data authentication and data integrity, the data consumer can verify the truthfulness of data collection by performing the second-layer batch verification with Equation (10). Proof of correctness is similar to Equation (15) for the first-layer batch verification, where we can just replace the aggregate signature $\sigma$ with $\prod_{i\in\mathbb{I}}\sigma_{i}$ .

4.3 Truthfulness of Data Processing

We now analyze the truthfulness of data processing from two aspects, i.e., correctness and completeness.

Correctness. TPDM ensures the truthfulness of data collection, which is the premise of a correct data service. Then, given a truthfully collected dataset, the data consumer can evaluate the $\phi$ candidate data sources, which is consistent with the original data processing due to the homomorphic properties of the partially homomorphic cryptosystem.

Completeness. In fact, our design provides the property of completeness by guaranteeing the correctness of $n$ , $m$ , and $\phi$ , which are the numbers of total, valid, and candidate data contributors, respectively:

First, the service provider cannot deliberately omit a data contributor’s real data. The reason is that if the data contributor has submitted her encrypted raw data, without finding her pseudo identity on the certificated bulletin board, she would obtain no reward for data contribution. Therefore, she has incentives to report data missing to the registration center, which in turn ensures the correctness of $n$ .

Second, we consider that the service provider compromises the number of valid data contributors $m$ in two ways: one is to put a valid data contributor’s pseudo identity into the blacklist; the other is to put an invalid pseudo identity into the whitelist. We discuss these two cases separately: 1) In the first case, the valid data contributor would not only receive no reward, but may also be revoked from the online registration database. Hence, she has strong incentives to resort to the registration center for arbitration. Besides, we claim that the service provider wins the arbitration except with negligible probability. We give the detailed proof via Game 3 between a challenger and an attacker:

Setup: The challenger first gives the attacker $m$ valid data tuples, denoted as $\{\langle\emph{PID}_{i},\vec{D}{{}_{i}},\sigma_{i}\rangle|i\in\mathbb{C}\}$ . This simulates the data submissions from $m$ valid data contributors.
Challenge: The challenger asks the attacker to pick a random data contributor $o_{i}$ within the $m$ given ones, and then requests the attacker to generate a signature $\sigma_{i}^{*}$ on the ciphertext vector $\vec{D}{{}_{i}}$ .
Guess: The attacker returns $\sigma_{i}^{*}$ to the challenger. The attacker wins Game 3, if $\sigma_{i}^{*}\neq\sigma_{i}$ , $\sigma_{i}^{*}$ passes the challenger’s verification, and $\sigma_{i}$ fails in the verification.

Next, we demonstrate that the attacker’s winning probability in Game 3, denoted as

[TABLE]

is negligible. On one hand, the verification scheme in TPDM is publicly verifiable, which indicates that the challenger can verify the legitimacies of both $\sigma_{i}^{*}$ and $\sigma_{i}$ through checking whether

[TABLE]

hold at the same time. We note that the above two equations conform to the formula of single signature verification, i.e., $n=1$ in Equation (6). However, the second one contradicts with our assumption that $o_{i}$ is a valid data contributor. On the other hand, $\sigma_{i}^{*}$ passes the challenger’s verification, while $\sigma_{i}^{*}$ is not equal to $\sigma_{i}$ , which implies that $\sigma_{i}^{*}$ is a valid signature forged by the attacker. As shown in Game 1, the probability of successfully forging a valid signature $\epsilon_{1}$ is negligible, and thus the attacker’s winning probability in Game 3 $\epsilon_{3}$ is negligible as well. This completes our proof of thwarting the first case; 2) The second case is essentially the tracing and revocation phase in Section 3.2, where the batch of signatures contains invalid ones. Therefore, this case cannot pass two-layer batch verifications. Besides, the greedy service provider has no incentives to reward those invalid data contributors, which could in turn destabilize the data market. Joint considering above two cases, our scheme TPDM can guarantee the correctness of $m$ .

Third, as stated in outcome verification, the data consumer reserves the right to verify over all $m$ valid data items, and the service provider cannot just process a subset without being found. Thus, the correctness of $\phi$ is assured.

In a nutshell, TPDM can guarantee the truthfulness of data processing in the data trading layer.

4.4 Data Confidentiality

Considering the potential economic value and the sensitive information contained in raw data, data confidentiality is a necessity in data markets. Since partially homomorphic encryption provides semantic security (e.g., [30, 24, 32]), by definition, except the registration center, any probabilistic polynomial-time adversary cannot reveal the contents of raw data. Moreover, although the registration center holds the private key, she cannot learn the sensitive raw data as well, since neither the service provider nor the data consumer directly forwards the original ciphertexts of the data contributors for decryption. Therefore, data confidentiality is achieved against all these system participants.

4.5 Identity Preservation

To protect a data contributor’s unique identifier in data markets, her real identity is converted into a random pseudo identity. We note that the two parts of a pseudo identity are actually two items of an ElGamal-type ciphertext, which is semantically secure under the chosen plaintext attack [24]. Furthermore, the linkability between a data contributor’s signatures does not exist, because the pseudo identities for different signing instances are indistinguishable. Hence, identity preservation can be ensured.

4.6 Semi-honest Registration Center

Registration center in TPDM performs two main tasks: one is to maintain the online database of legal registrations; the other is to set up the partially homomorphic cryptosystem.

First, as we have clarified in Section 4.4, TPDM guarantees data confidentiality against the registration center. Thus, although she maintains the database of real identities, she cannot link them with corresponding raw data. Second, partially homomorphic encryption schemes (e.g., [30, 24, 32]) normally provide a proof of decryption, which indicates that the registration center cannot corrupt the decrypted results undetectably. Hence, she virtually has no effect on data processing and outcome verification. At last, we will further show the feasibility of distributing multiple registration centers in our evaluation part.

5 Two Practical Data Markets

In this section, from a practical standpoint, we consider two practical data markets, which provide fine-grained profile matching and multivariate Gaussian distribution fitting, respectively. The major difference between these two data markets is whether the data consumer has inputs.

5.1 Fine-grained Profile Matching

We first elaborate on a classic data service in social networking, i.e., fine-grained profile matching. Unlike the directly interactive scenario in [25], our centralized data market breaks the limit of neighborhood finding. In particular, a data consumer’s friending strategy can be derived from a large scale of data contributions. For convenience, we shall not differentiate “profile” from “raw data” in the profile-matching scenario considered here.

During the initial phase of profile matching, the service provider, e.g., Twitter or OkCupid, defines a public attribute vector consisting of $\beta$ attributes $\mathbf{A}=(A_{1},A_{2},\cdots,A_{\beta})$ , where $A_{i}$ corresponds to a personal interest, such as movie, sports, cooking, and so on. Then, to create a fine-grained personal profile, a data contributor $o_{i}$ , e.g., a Twitter or OkCupid user, selects an integer $u_{ij}\in[0,\theta]$ to indicate her level of interest in $A_{j}\in\mathbf{A}$ , and thus forms her profile vector $\vec{U}_{i}=(u_{i1},u_{i2},\cdots,u_{i\beta}).$ Subsequently, the data contributor $o_{i}$ submits $\vec{U}_{i}$ to the service provider for matching process.

To facilitate profile matching, the data consumer also needs to provide her profile vector $\vec{V}=(v_{1},v_{2},\cdots,v_{\beta})$ and an acceptable similarity threshold $\delta$ , where $\delta$ is a non-negative integer. Without loss of generality, we assume that the service provider employs Euclidean distance $f(\cdot)$ to measure the similarity between the data contributor $o_{i}$ and the data consumer, where $f(\vec{U}_{i},\vec{V})=\sqrt{\sum_{j=1}^{\beta}{(u_{ij}-v_{j})}^{2}}$ . We note that if $f(\vec{U}_{i},\vec{V})<\delta,$ then the data contributor $o_{i}$ is a matching target to the data consumer. In what follows, to simplify construction, we covert the matching metric $f(\vec{U}_{i},\vec{V})<\delta$ to its squared form $\sum_{j=1}^{\beta}{(u_{ij}-v_{j})}^{2}<\delta^{2}.$

5.1.1 Recap of Adversary Model

Before introducing our concrete construction, we first give a brief review of the adversary model and corresponding security requirements in the context of profile matching.

As shown in Fig. 3, Alice and Bob are registered data contributors, and Charlie is a data consumer. Here, the partial data collection attack means that to reduce data acquisition cost, the service provider may insert unregistered/fake David’s profile. Besides, the partial data processing attack indicates that to reduce operation cost, the service provider may just evaluate the similarity between Charlie and Alice, while generating a random result for Bob. Moreover, the no data processing attack implies that the service provider just returns two random matching results without processing the profiles of both Alice and Bob.

Our joint security requirements of privacy preservation and data truthfulness mainly include two aspects: 1) Without leaking the real identities and the profiles of Alice and Bob, the service provider needs to prove the legitimacies of Alice and Bob to Charlie; 2) Without revealing Alice’s and Bob’s profiles, Charlie can verify the correctness and completeness of returned matching results.

5.1.2 BGN-Based Construction

Given the profile-matching scenario considered here, we utilize a partially homomorphic encryption scheme based on bilinear maps, called Boneh-Goh-Nissim (BGN) cryptosystem [32]. This is because we only require the oblivious evaluation of quadratic polynomials, i.e., $\sum_{j=1}^{\beta}{(u_{ij}-v_{j})}^{2}$ . In particular, the BGN scheme supports any number of homomorphic additions after a single homomorphic multiplication. Now, we briefly introduce how to adapt TPDM to this practical data market. Due to the limitation of space, we focus on the major phases, including data submission, data processing, and outcome verification.

Data Submission: When a data contributor $o_{i}$ intends to submit her profile $\vec{U}_{i}$ , she employs the BGN scheme to do encryption, and gets the ciphertext vector:

[TABLE]

Afterwards, the data contributor $o_{i}$ computes the signature $\sigma_{i}$ on $\vec{D}_{i}$ using her secret key $SK_{i}$ :

[TABLE]

where $D_{i}=E(u_{i1})\parallel\ldots\parallel E(u_{i\beta})\parallel E({u_{i1}}^{2})\parallel\ldots\parallel E({u_{i\beta}}^{2}),$ and “ $\parallel$ ” is a message concatenation operation.

By the end of a time period, $n$ distinct data contributors submit their tuples $\{\langle\emph{PID}_{i},\vec{D}{{}_{i}},\sigma_{i}\rangle|i\in[1,n]\}$ to the service provider, on which the first-layer batch verification can be conducted using Equation (6).

Data Processing: To facilitate generating a personalized friending strategy, the data consumer also needs to provide her encrypted profile vector $\vec{D}_{0}$ and a threshold $\delta$ , where

[TABLE]

Now, the service provider can directly do matching on the encrypted profiles. For brevity in expression, we assume that $o_{i}$ is one of the $m$ valid data contributors, i.e., $i\in\mathbb{C}$ . Besides, to obliviously evaluate the similarity $f(\vec{U}_{i},\vec{V})$ , the service provider first preprocesses $\vec{D}_{i}$ and $\vec{D}_{0}$ by adding $E(1)$ to the first and the last places of two vectors, respectively, and obtains new vectors $\vec{C}_{i}=(C_{ij}^{1},C_{ij}^{2},C_{ij}^{3})|_{j\in[1,\beta]}$ and $\vec{C}_{0}=(C_{0j}^{1},C_{0j}^{2},C_{0j}^{3})|_{j\in[1,\beta]}$ , where

[TABLE]

After preprocessing, the service provider can compute the “dot product” of Equation (21) and Equation (22), by first applying homomorphic multiplication $\otimes$ and then homomorphic addition $\oplus$ , and gets $R_{ij}$ , where

[TABLE]

Next, the service provider applies homomorphic additions $\oplus$ to $R_{ij}$ with $\forall j\in[1,\beta]$ , and gets

[TABLE]

We note that $R_{i}$ is actually an encryption of $f(\vec{U}_{i},\vec{V})^{2}$ , which indicates the similarity between the data contributor $o_{i}$ and the data consumer.

Then, the service provider sends $R_{i}$ to the registration center for decryption. We note that for each data contributor, the registration center just needs to do one decryption, i.e., supposing the size of whitelist on the certificated bulletin board is $m$ , she can only perform $m$ decryptions in total. The registration center cannot do more decryptions than required, since the service provider may still obtain a correct and complete matching strategy by revealing the profiles of all the valid data contributors and the data consumer. However, this case requires at least $(m+1)\beta$ decryptions. Furthermore, to speed up BGN decryption in outcome verification, the registration center should retain the decrypted plaintexts in storage for a preset validity period.

When obtaining $f(\vec{U}_{i},\vec{V})^{2}$ , the service provider compares it with $\delta^{2}$ , and thus can determine whether the data contributor $o_{i}$ matches the data consumer. We assume that $\phi$ data contributors are matched, and the subscripts of their pseudo identities are denoted as $\mathbb{I}=\{I_{1},I_{2},\cdots,I_{\phi}\}$ .

After data processing, the service provider aggregates the signatures of $\phi$ matched data contributors into one signature. Then, she sends the aggregate signature, the indexes of $\phi$ matched data contributors, and their encrypted profile vectors to the data consumer, on which the second-layer batch verification can be performed with Equation (10). Besides, to prevent the service provider from changing/revaluating $(m-\phi)$ valid but unmatched data contributors in the completeness verification later, their similarities, i.e., $\{f(\vec{U}_{i},\vec{V})^{2}|i\in\mathbb{C},i\notin\mathbb{I}\},$ should also be forwarded. We note that the pseudo identities of $\phi$ matched data contributors can be viewed as the friending strategy, i.e., outcome in the general model, since the data consumer can resort to the registration center, as a relay, for handshaking with those matched data contributors.

Outcome Verification: During the validity period preset by the registration center, the data consumer can verify the truthfulness of data processing via homomorphic properties. For correctness, the data consumer just needs to evaluate the $\phi$ matched profiles. Of course, for completeness, the data consumer reserves the right to do verification over the other $(m-\phi)$ unmatched ones. We note that the data consumer, knowing her own profile vector $\vec{V}$ , can compute Equation (5.1.2) more efficiently through

[TABLE]

Thus, the most time-consuming homomorphic multiplications can be avoided in outcome verification. Moreover, we note that the registration center does not need to do decryption as in data processing, since she can just search a smaller-size table of plaintexts in the storage. If there is no matched one, the outcome verification fails, and the service provider will be questioned by the data consumer.

To further reduce verification cost, the data consumer can take the stratified sampling strategy in practice, e.g., in our evaluation on a real-world ratings dataset, for correctness, she may check all the $\phi$ matched data contributors, accounting for $4.49\%$ of the total 10000 samples, while only checking $0.27\%$ of the unmatched ones for completeness. In particular, regarding completeness verification, the data consumer can randomly choose part of the $(m-\phi)$ valid but unmatched data contributors, and then request the service provider to send her their aggregate signature and encrypted profile vectors for the second-layer batch verification and the outcome verification. Here, we assume that the greedy service provider cheats by not evaluating each data contributor in the original data processing with a probability $p$ . Then, the probability of successfully detecting an attempt to return an incorrect/incomplete result, $\epsilon$ , increases exponentially with the number of checks $c$ , i.e., $\epsilon=1-(1-p)^{c}$ . For example, when $p=20\%$ and $c=10$ , the success rate $\epsilon$ is already $90\%$ . In fact, a concrete sampling strategy should depend on practical $\phi$ , $m$ , and $p$ .

5.2 Multivariate Gaussian Distribution Fitting

We further consider a different data market, where the service provider captures the underlying probability distribution over the collected dataset, and offers such a distribution as a data service to the data consumer [35, 36]. This data service is called probability distribution fitting. For example, a data analyst, as the data consumer, may want to learn the distribution of residential energy consumptions.

Due to central limit theorem, we assume that the multivariate Gaussian distribution can closely approximate the raw data, which is a widely used assumption in statistical learning algorithms [37]. For convenience, we continue to use the notations in profile matching, i.e., the attribute vector $\mathbf{A}$ now represents a vector of $\beta$ random variables. In particular, $\mathbf{A}\sim\mathcal{N}(\vec{\mu},\mathbf{\Sigma})$ , where $\vec{\mu}$ is a $\beta$ -dimensional mean vector, and $\mathbf{\Sigma}$ is a $\beta\times\beta$ covariance matrix. Besides, the covariance matrix can be computed by:

[TABLE]

Here, $\mathbb{E}[\cdot]$ denotes taking expectation. We below focus on the key designs different from profile matching.

For data submission, the cipertext vector of the data contributor $o_{i}$ is changed into:

[TABLE]

where its first element is to facilitate computing the mean vector $\vec{\mu}$ , while its second element is to help the service provider in evaluating the matrix $\mathbb{E}[\mathbf{A}\mathbf{A}^{T}]$ more efficiently.

For data processing, the service provider first employs homomorphic additions $\oplus$ to obliviously compute the mean vector $\vec{\mu}$ , where the ciphertext of its $j$ -th element multiplying the number of valid data contributors $m$ is:

[TABLE]

Additionally, to derive the covariance matrix $\mathbf{\Sigma}$ , it suffices for the service provider to get $\mathbb{E}[\mathbf{A}\mathbf{A}^{T}]$ . Here, the service provider can avoid the time-consuming homomorphic multiplications. For example, the $j$ -th row, $k$ -th column entry of $\mathbb{E}[\mathbf{A}\mathbf{A}^{T}]$ , denoted by $\mathbb{E}[\mathbf{A}\mathbf{A}^{T}]_{jk}$ , can be computed through:

[TABLE]

However, supposing that the data contributor $o_{i}$ excluded $\{E(u_{ij}\times u_{ik})|j\in[1,\beta],k\in[j,\beta]\}$ from her ciphertext vector $\vec{D}_{i}$ , the service provider would need to perform $\frac{\beta(\beta+1)}{2}$ expensive homomorphic multiplications for the data contributor $o_{i}$ instead, because $E(u_{ij}\times u_{ik})$ in Equation (29) needs to be derived by means of $E(u_{ij})\otimes E(u_{ik})$ .

For outcome verification, the data consumer can take the stratified random sampling strategy from two aspects: 1) She can randomly check parts of the mean vector $\vec{\mu}$ and the covariance matrix $\mathbf{\Sigma}$ ; 2) She can reevaluate a random subset of $m$ valid data items, and compare the new distribution with the returned distribution. If their distance is within a threshold, the data consumer would accept this outcome; otherwise, she rejects. We note that in the first case, the valid data contributors may need to re-sign those involved ciphertexts for the second-layer batch verification.

6 Evaluation Results

In this section, we show the evaluation results of TPDM in terms of computation overhead and communication overhead. We also demonstrate the feasibility of the registration center and the $\ell$ -depth-tracing algorithm.

Datasets: We use two real-world datasets, called R1-Yahoo! Music User Ratings of Musical Artists Version 1.0 [38] and 2009 Residential Energy Consumption Survey (RECS) dataset [39], for the profile matching service and the distribution fitting service, respectively.

First, the Yahoo! dataset represents a snapshot of Yahoo! Music community’s preference for various musical artists. It contains 11,557,943 ratings of 98,211 artists given by 1,948,882 anonymous users, and was gathered over the course of one month prior to March 2004. For profile matching, we choose $\beta$ common artists as the attributes, append each user’s corresponding ratings ranging from 0 to 10, and thus form her fine-grained profile. Second, the RECS dataset, which was released by U.S. Energy Information Administration (EIA) in January 2013, provides detailed information about diverse energy usages in U.S. homes. This dataset was collected from 12,083 randomly selected households between July 2009 and December 2012. For distribution fitting, we view $\beta$ types of energy consumptions, e.g., electricity, natural gas, space heating, and water heating, as $\beta$ random variables, and intend to obtain the multivariate Gaussian distribution.

Evaluation Settings: We implemented TPDM using the latest Pairing-Based Cryptography (PBC) library [40]. The elliptic curves utilized in our identity-based signature scheme include a supersingular curve with a base field size of 512 bits and an embedding degree of 2, and a MNT curve with a base field size of 159 bits and an embedding degree of 6. In addition, the group order $q$ is 160-bit long, and all hashings are implemented in SHA1, considering its digest size closely matches the order of $\mathbb{G}_{1}$ . The BGN cryptosystem is realized using Type A1 pairing, in which the group order is a product of two 512-bit primes. The running environment is a standard 64-bit Ubuntu 14.04 Linux operation system on a desktop with Intel(R) Core(TM) $i5$ $3.10GHz$ .

Overheads of Key Operations: Table I presents the curve choices along with the computation time of key operations, where SS512 and MNT159 are abbreviated from the settings of the supersingular curve and the MNT curve in the identity-based signature scheme, respectively. $\mathcal{R}(\cdot)$ denotes the number of bits needed to optimally represent a group element. Besides, all the computation time of key operations is derived from the average of 10000 runs.

6.1 Computation Overhead

We show the computation overheads of four important components in TPDM, namely profile matching, distribution fitting, identity-based signature, and batch verification.

Profile Matching: In Fig. 4a, we plot the computation overheads of profile encryption, similarity evaluation, and outcome verification per data contributor, when the number of attributes $\beta$ increases from 5 to 40 with a step of 5. From Fig. 4a, we can see that the computation overheads of these three phases increase linearly with $\beta$ . This is because the profile encryption requires $2\beta$ BGN encryptions, the similarity evaluation consists of $3\beta$ homomorphic multiplications and additions, and the outcome verification is composed of $3\beta$ homomorphic additions and $\beta$ exponentiations, which are both proportional to $\beta$ . In addition, the outcome verification is light-weight, whose overhead per data contributor is only $1.17\%$ of the similarity evaluation’s cost. Moreover, when $\beta=10$ , one decryption overhead at the registration center is 1.648ms in the original data processing, while in outcome verification, it is in tens of microseconds.

We further show the feasibility of the stratified sampling strategy in outcome verification. We analyze the matching ratio based on Yahoo! Music ratings dataset. Given $\beta=10$ , when a data consumer sets her threshold $\delta=12$ , she is matched with $4.49\%$ in average of the 10000 data contributors, who are selected randomly from the dataset. The relatively small matching ratio means that even if all matched data contributors are verified for correctness, it only incurs an overhead of 4.859s at the data consumer, which is roughly $0.05\%$ of the data processing workload at the service provider. Next, we simulate the partial data processing attack by randomly corrupting $20\%$ of unmatched data contributors, i.e., replacing their similarities with random values. Then, the data consumer can detect such type attack using 26 random checks in average for completeness, which incurs an additional overhead of 0.281s.

Distribution Fitting: Fig. 4b plots the computation overhead of the distribution fitting service, where the number of random variables $\beta$ increases from 1 to 8, and the number of valid data contributors $m$ is fixed at 10000. Besides, for outcome verification, the data consumer checks all the elements in the mean vector, while only checks the diagonal elements in the covariance matrix. From Fig. 4b, we can see that the computation overheads of the first two phases increase quadratically with $\beta$ , whereas the computation overhead of the last phase increases linearly with $\beta$ . The reason is that the data encryption phase consists of $\frac{\beta(\beta+3)}{2}$ BGN encryptions for each data contributor, and the distribution evaluation phase mainly comprises $\frac{m\beta(\beta+3)}{2}$ homomorphic additions. In contrast, the outcome verification phase mainly requires $2m\beta$ homomorphic additions. Furthermore, when $\beta=8$ , these three phases consume 0.402s, 140.395s, and 51.200s, respectively.

Jointly summarizing above evaluation results, TPDM performs well in both kinds of data markets. Therefore, the generality of TPDM can be validated.

Identity-Based Signature: We now investigate the computation overhead of the identity-based signature scheme, including preparation and operation phases. In this set of simulations, we set the number of data contributors to be $10000$ . Table II lists the average time overhead per data contributor. From Table II, we can see that the time cost of the preparation phase dominates the total overhead in both SS512 and MNT159. This outcome stems from that the pseudo identity generation employs ElGamal encryption, and the secret key generation is composed of one MapToPoint hash operation and two exponentiations. In contrast, the operation phase mainly consists of one exponentiation.

The above results demonstrate that the identity-based signature scheme in TPDM is efficient enough, and can be applied to the data contributors with mobile devices.

Batch Verification: To examine the efficiency of batch verification, we vary the number of data contributors from 1 to 1 million by exponential growth. The performance of the corresponding single signature verification is provided as a baseline. Fig. 4c depicts the evaluation results using SS512 and MNT159, where verification time per signature (abbreviated as VTPS) is computed by dividing the total verification time by the number of data contributors. In particular, such a performance measure in an average sense can be found in [41]. From Fig. 4c, we can see that when the scale of data acquisition or data trading is small, e.g., when the number of data contributors is 10, TPDM saves 48.22 $\%$ and 87.94 $\%$ of VTPS in SS512 and MNT159, respectively. When the scale becomes larger, TPDM’s advantage over the baseline is more remarkable. This is owing to the fact that TPDM amortizes the overhead of 3 time-consuming pairing operations among all the data contributors.

We now compare the batch verification efficiency of two settings. Although the baseline of MNT159 increases $41.44\%$ verification latency than that of SS512, MNT159’s implementation is more efficient when the number of data contributors is larger than 10, e.g., when supporting as many as 1 million data contributors, MNT159 reduces $89.93\%$ of VTPS than SS512. We explain the reason by analyzing the asymptotic value of VTPS:

[TABLE]

Here, we let $T_{par}$ , $T_{mtp}$ , and $T_{exp}$ denote the time overheads of a pairing operation, a MapToPoint hashing, and an exponentiation in Table I, respectively. From Equation (30), we can draw that if the time overheads of additional operations, e.g., $T_{mtp}$ and $T_{exp}$ , are approaching or even greater than that of pairing operation (e.g., in SS512), their effect cannot be elided. Besides, the expensive additional operations will cancel part of the advantage gained by batch verification. Even so, the batch verification scheme can still sharply reduce per-signature verification cost.

These evaluation results reveal that TPDM can indeed help to reduce the computation overheads of the service provider and the data consumer by introducing two-layer batch verifications, especially in large-scale data markets.

6.2 Communication Overhead

In this section, we show the communication overheads of the profile matching service and the distribution fitting service separately.

Fig. 7 plots the communication overhead of profile matching, where the identity-based signature scheme is implemented in MNT159, the number of attributes $\beta$ is fixed at 10, and the threshold $\delta$ takes $12$ . Here, the communication overhead merely counts in the amount of sending content. Besides, we only consider the correctness verification. In fact, when the number of valid data contributors $m$ is $10^{4}$ , if we check 26 unmatched ones for completeness, it incurs additional communication overheads of 80.03KB at the service provider, and 3.35KB at the data consumer. Moreover, our statistics on the dataset show a linear correlation between the numbers of matched data contributors $\phi$ and valid ones $m$ , where the matching ratio is $4.24\%$ in average.

The first observation from Fig. 7 is that the communication overheads of the service provider and the data consumer grow linearly with the number of valid data contributors $m$ , while the communication overhead of each data contributor remains unchanged. The reason is that each data contributor just needs to do one profile submission, and thus its cost is independent of $m$ . However, the service provider primarily needs to send $m$ encrypted similarities for decryption, and to forward the indexes and the ciphertexts of $\phi$ matched data contributors for verifications. Regarding the data consumer, her communication overhead mainly comes from one data submission and the delivery of $\phi$ encrypted similarities for decryption. These imply that the communication overheads of the service provider and the data consumer are both linear with $m$ . Here, we note that x, y axes in Fig. 7 are log-scaled, and thus the communication overhead of the data consumer, containing a constant of one data submission overhead, seems non-linear. In particular, when $m\leq 100$ , one data submission overhead dominates the total communication overhead, and this interval looks like a horizontal line; while $m\geq 1000$ , the communication overhead of delivering $\phi$ encrypted similarities dominates, and that interval appears linear.

The second key observation from Fig. 7 is that when the number of valid data contributors $m=10$ , all the three system participants spend roughly the same network bandwidth. The cause lies in that the small matching ratio implies a small number of matched data contributors involved in the correctness verification. Specifically, when $m=10$ , the average number of matched data contributors is only about $0.4<1$ , and the communication overheads of each data contributor, the service provider, and the data consumer are 2.60KB, 2.37KB, and 2.59KB, respectively.

We plot the communication overhead of multivariate Gaussian distribution fitting in Fig. 7, where the number of random variables $\beta$ is set to be 8. From Fig. 7, we can see that the communication overhead of the service provider increases linearly with the number of valid data contributors $m$ . This is because the service provider mainly needs to send $2\beta m$ BGN-type ciphertexts for verifications, which is linear with $m$ . By comparison, besides each data contributor, the data consumer’s bandwidth overhead stays the same, since she needs to deliver $2\beta$ BGN-type ciphertexts for decryption, which is independent of $m$ .

We finally note that the transmission of BGN-type ciphertexts dominates the total communication overheads in both data services, while the network overhead incurred by sending the pseudo identities and the aggregate signature is comparatively low. Therefore, we do not plot the cases for SS512, which are similar to Fig. 7 and Fig. 7. In particular, compared with MNT159, SS512 adds 132 bytes and 176 bytes at each data contributor in profile matching and distribution fitting, respectively. Besides, SS512 adds 44 bytes at the service provider in two data services, but incurs no extra bandwidth overhead at the data consumer.

6.3 Feasibility of Registration Center

In this section, we consider the feasibility of the registration center from the perspectives of computation, communication, and storage overheads. We implement the identity-based signature scheme with MNT159. In addition, for profile matching, the number of attributes is fixed at 10, and the number of valid data contributors $m$ is set to be 10000. Accordingly, the number of matched ones $\phi$ is 449 at $\delta=12$ . For distribution fitting, we fix the number of random variables $\beta$ at 8, and set the number of valid data contributors to be 10000.

First, the primary responsibility of the registration center is to initialize the system parameters for the identity-based signature scheme and the BGN cryptosystem. Besides, she is required to perform totally $(m+\phi)$ and $\frac{(\beta+7)\beta}{2}$ decryptions in the profile matching service and the distribution fitting service, respectively. The total computation overheads are 16.692s and 3.065s in two data services, respectively, which are only $0.18\%$ and $2.11\%$ of the service provider’s overall workloads. Furthermore, the one-time setup overhead can be amortized over several data services. Second, the main communication overheads of the registration center in two data services are incurred by returning decrypted results, which occupies the network bandwidth of 15.31KB and 0.23KB, respectively. Third, the storage overhead of the registration center mostly comes from maintaining the online database of registrations and the real-time certificated bulletin board, and caching the intermediate plaintexts. These two parts take up roughly 600.59KB and 586.11KB storage space in the profile matching service and the distribution fitting service, respectively.

In conclusion, our design of registration center has a light load, and can be implemented in a distributed manner, where each registration center can be responsible for one or a few data services. In particular, consistent hashing [42] can be employed to facilitate the information synchronization among multiple registration centers, e.g., guaranteeing a certain number of decryptions for each data service. Besides, using the standard techniques from [43], the original partially homomorphic cryptosystems can be extended to their threshold multi-authority versions, which implies the improved robustness of TPDM by distributing several registration centers in data markets.

6.4 Feasibility of Tracing Algorithm

To evaluate the feasibility of $\ell$ -depth-tracing algorithm when the batch verification fails, we generate a collection of 1024 valid signatures, and then randomly corrupt an $\alpha$ -fraction of the batch by replacing them with random elements from the cyclic group $\mathbb{G}_{1}$ . We repeat this evaluation with various values of $\alpha$ ranging from 0 to $20\%$ , and compare verification time per signature (VTPS) in batch verification with that in single signature verification. Here, the overall batch verification latency includes the time cost spent in identifying invalid signatures. Fig. 7 presents the evaluation results using the efficient MNT159.

As shown in Fig. 7, batch verification is preferable to single signature verification when the ratio of invalid signatures is up to $16\%$ . The worst case of batch verification happens when the invalid signatures are distributed uniformly. In case the invalid signatures are clustered together, the performance of batch verification should be better. Furthermore, as shown in the initialization phase of Algorithm 1, the service provider can preset a practical tracing depth, and let those unidentified data contributors do resubmissions.

7 Related Work

In this section, we briefly review related work.

7.1 Data Market Design

In recent years, data market design has gained increasing interest from the database community. The seminal paper [10] by Balazinska et al. discusses the implications of the emerging digital data markets, and lists the research opportunities in this direction. Koutris et al. [44] presented a flexible data trading format, i.e., query-based data pricing. Later, Lin and Kifer [45] designed an arbitrage-free pricing function for arbitrary query formats. For personal data sharing, Li et al. [46] proposed a theory of pricing private data based on differential privacy. Upadhyaya et al. [11] developed a middleware system, called DataLawyer, to formally specify data use policies, and to automatically enforce these pre-defined terms during data usage. Jung et al. [47] focused on the dataset resale issue at the dishonest data consumers.

However, the original intention of above works is pricing data or monitoring data usage rather than integrating data truthfulness with privacy preservation in data markets, which is the consideration of this work111The early version of this paper [48] mainly focused on the profile matching service..

7.2 Signcryption

Data authentication and data confidentiality are two basic requirements in secure communication. To efficiently guarantee these two properties simultaneously, Zheng et al. [49] first introduced the terminology signcryption, which integrates digital signature with public key encryption. To reduce communication overhead, a number of identity-based signcryption schemes [50, 51] were proposed. In most signcryption schemes (e.g., [50, 49]), a third party requires the knowledge of plaintext to verify the message’s origin. In contrast to these works, Encrypt-then-Sign paradigm in [23] and the identity-based signcryption scheme in [51] support public ciphertext authenticity, which convinces a third party (e.g., the data consumer in our model) of data sources’ reliability without revealing the content of raw data.

Unfortunately, when existing signcryption schemes are directly applied to data markets, they only provide the truthfulness of data collection, but fail to support outcome verification. Besides, these schemes don’t facilitate identity preservation and batch verification [52], which are necessary in practical data collection and data trading environments.

7.3 Practical Computation on Encrypted Data

To get a tradeoff between functionality and performance, partially homomorphic encryption (PHE) schemes were exploited to enable practical computation on encrypted data. Unlike those prohibitively slow fully homomorphic encryption (FHE) schemes [53, 54] that support arbitrary operations, PHE schemes focus on specific function(s), and achieve better performance in practice. A celebrated example is the Paillier cryptosystem [30], which preserves the group homomorphism of addition and allows multiplication by a constant. Thus, it can be utilized in data aggregation [20] and interactive personalized recommendation [25, 33]. Yet, another one is ElGamal encryption [24], which supports homomorphic multiplication, and it is widely employed in voting [55]. Moreover, the BGN scheme [32] facilitates one extra multiplication followed by multiple additions, which in turn allows the oblivious evaluation of quadratic multivariate polynomials, e.g., shortest distance query [31] and optimal meeting location decision [56].

These schemes enable the service provider and the data consumer to efficiently perform data processing and outcome verification over encrypted data, respectively. Thus, they can remedy the potential defects in the conventional signcryption schemes. Additionally, we note that the outcome verification in data markets differs from the verifiable computation in outsourcing scenarios [57], since before data processing, the data consumer, as a client, does not hold a local copy of the collected dataset.

8 Conclusion and Future Work

In this paper, we have proposed the first efficient secure scheme TPDM for data markets, which simultaneously guarantees data truthfulness and privacy preservation. In TPDM, the data contributors have to truthfully submit their own data, but cannot impersonate others. Besides, the service provider is enforced to truthfully collect and process data. Furthermore, both the personally identifiable information and the sensitive raw data of data contributors are well protected. In addition, we have instantiated TPDM with two different data services, and extensively evaluated their performances on two real-world datasets. Evaluation results have demonstrated the scalability of TPDM in the context of large user base, especially from computation and communication overheads. At last, we have shown the feasibility of introducing the semi-honest registration center with detailed theoretical analysis and substantial evaluations.

As for further work in data markets, it would be interesting to consider diverse data services with more complex mathematic formulas, e.g., Machine Learning as a Service (MLaaS) [58, 59, 60, 61, 29]. For a specific data service, it is well-motivated to uncover some novel security problems, such as privacy preservation and verifiability.

Acknowledgments

This work was supported in part by the State Key Development Program for Basic Research of China (973 project 2014CB340303), in part by China NSF grant 61672348, 61672353, 61422208, and 61472252, in part by Shanghai Science and Technology fund (15220721300, 17510740200), in part by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, and in part by the CCF Tencent Open Research Fund (RAGR20170114). The opinions, findings, conclusions, and recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies or the government. F. Wu is the corresponding author.

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] “Gnip,” https://gnip.com/ .
2[2] “Data Sift,” http://datasift.com/ .
3[3] “Datacoup,” https://datacoup.com/ .
4[4] “Citizenme,” https://www.citizenme.com/ .
5[5] E. Ramirez, J. Brill, M. K. Ohlhausen, J. D. Wright, and T. Mc Sweeny, “Data brokers: A call for transparency and accountability,” Federal Trade Commission, Tech. Rep., May 2014. [Online]. Available: https://www.ftc.gov/reports/data-brokers-call-transparency-accountability-report-federal-trade-commission-may-2014
6[6] “Gallup Poll,” http://www.gallup.com/ .
7[7] M. Barbaro, T. Zeller, and S. Hansell, “A face is exposed for AOL searcher no. 4417749,” New York Times , Aug. 2006.
8[8] “2016 TRUS Te/NCSA Consumer Privacy Infographic - US Edition,” https://www.truste.com/resources/privacy-research/ncsa-consumer-privacy-index-us/ .