Probabilistic Top-k Dominating Query Monitoring over Multiple Uncertain IoT Data Streams in Edge Computing Environments

Chuan-Chi Lai; Tien-Chun Wang; Chuan-Ming Liu; Li-Chun Wang

arXiv:1906.00219·cs.DC·January 27, 2026

Probabilistic Top-k Dominating Query Monitoring over Multiple Uncertain IoT Data Streams in Edge Computing Environments

Chuan-Chi Lai, Tien-Chun Wang, Chuan-Ming Liu, Li-Chun Wang

PDF

TL;DR

This paper presents a parallel probabilistic top-k dominating query method for uncertain IoT data streams in edge computing, significantly improving processing speed, reducing communication costs, and maintaining high accuracy.

Contribution

It introduces a novel parallel approach for top-k dominating queries over uncertain IoT streams, optimizing for speed, cost, and accuracy in edge environments.

Findings

01

Improves computation time by nearly 60%

02

Reduces communication cost by about 20%

03

Maintains high accuracy in most scenarios

Abstract

Extracting the valuable features and information in Big Data has become one of the important research issues in Data Science. In most Internet of Things (IoT) applications, the collected data are uncertain and imprecise due to sensor device variations or transmission errors. In addition, the sensing data may change as time evolves. We refer an uncertain data stream as a dataset that has velocity, veracity, and volume properties simultaneously. This paper employs the parallelism in edge computing environments to facilitate the top-k dominating query process over multiple uncertain IoT data streams. The challenges of this problem include how to quickly update the result for processing uncertainty and reduce the computation cost as well as provide highly accurate results. By referring to the related existing papers for certain data, we provide an effective probabilistic top-k dominating…

Tables3

Table 1. TABLE I: Comparisons of Related Works and the Proposed Method

	Characteristics
Methods	Data Type	Continuous Query	Distributed Computing	Real-time
BIG [15]	Certain	×	×	×
TDTS [16]	Certain	×	×	×
SFA [17]	Certain	×	✓	×
MRBIG [18]	Certain	×	✓	×
TPTD [19]	Uncertain	×	×	×
PTOPK [20]	Uncertain	×	×	×
PEA [21]	Uncertain	✓	×	✓
PTDMUS	Uncertain	✓	✓	✓

Table 2. TABLE II: An example of a two-dimensional uncertain data set

Object	Instance	Object	Instance
$u_{1}$	$u_{1}^{1} [0.4, 28, 7]$	$u_{2}$	$u_{2}^{1} [0.6, 21, 16]$
	$u_{1}^{2} [0.3, 31, 11]$		$u_{2}^{2} [0.1, 17, 21]$
	$u_{1}^{3} [0.3, 35, 8]$		$u_{2}^{3} [0.3, 15, 17]$
$u_{3}$	$u_{3}^{1} [0.7, 72, 33]$	$u_{4}$	$u_{4}^{1} [0.8, 48, 19]$
	$u_{3}^{2} [0.2, 67, 30]$		$u_{4}^{2} [0.1, 43, 23]$
	$u_{3}^{3} [0.1, 64, 35]$		$u_{4}^{3} [0.1, 52, 26]$

Table 3. TABLE III: Simulation Parameters

Parameter	Default Value	Range (type)
Number of data objects, $\| U \|$	$10000$	-
Number of instances, $n$	$5$	-
Dimension, $d$	$9$	$3, 5, 7, 9$
Space of an attribute	$[0, 2000]$	-
Number of monitor nodes, $m$	$10$	$4, 6, 8, 10$
Size of a local sliding window, $\| S W_{j} \|$	$960$	$240, 480, 720, 960$
Size of the global sliding window, $\| S W_{H} \|$	$9600$	$m \times \| S W_{j} \|$
Degree of R-tree, $R_{d e g r e e}$	$6$	-
Threshold, $δ$	$30$	$10, 20, 30, 40, 50$
Margin of Uncertainty, $M$	$160$	$80, 160, 240, 320$
Distribution	Uniform	-
$k$	$100$	$50, 100, 150, 200$

Equations85

P r [u_{i} ≺ u_{j}] = a = 1 \sum n (P r (u_{i}^{a}) \times \forall u_{j}^{b} \in u_{j}, u_{i}^{a} ≺ u_{j}^{b} \sum P r (u_{j}^{b})) .

P r [u_{i} ≺ u_{j}] = a = 1 \sum n (P r (u_{i}^{a}) \times \forall u_{j}^{b} \in u_{j}, u_{i}^{a} ≺ u_{j}^{b} \sum P r (u_{j}^{b})) .

d o m (u_{i}^{a}) = u_{j} \in U, i \neq = j \sum {P r (u_{i}^{a}) \times P r (u_{j}^{b}) ∣ u_{i}^{a} ≺ u_{j}^{b}} .

d o m (u_{i}^{a}) = u_{j} \in U, i \neq = j \sum {P r (u_{i}^{a}) \times P r (u_{j}^{b}) ∣ u_{i}^{a} ≺ u_{j}^{b}} .

d o m (u_{i}) = a = 1 \sum n d o m (u_{i}^{a}) .

d o m (u_{i}) = a = 1 \sum n d o m (u_{i}^{a}) .

d o m (u_{1}) =

d o m (u_{1}) =

=

\times

+

=

=

d o m (u_{2}) =

d o m (u_{2}) =

=

=

+

=

=

r - d o m (u_{i}^{a}) = u_{j}^{b} ≺ u_{i}^{a}, u_{j} \in S W - {u_{i}} \sum P r (u_{i}^{a}) \times P r (u_{j}^{b}),

r - d o m (u_{i}^{a}) = u_{j}^{b} ≺ u_{i}^{a}, u_{j} \in S W - {u_{i}} \sum P r (u_{i}^{a}) \times P r (u_{j}^{b}),

r - d o m (u_{i}) = a = 1 \sum n r - d o m (u_{i}^{a}) .

r - d o m (u_{i}) = a = 1 \sum n r - d o m (u_{i}^{a}) .

m c t (u) = min (e x p_{min}, ⌊ \frac{d o m _{k} - d o m ( u )}{mn} ⌋ + t_{c u r}),

m c t (u) = min (e x p_{min}, ⌊ \frac{d o m _{k} - d o m ( u )}{mn} ⌋ + t_{c u r}),

T_{average} (N_{j}, t = 0) =

T_{average} (N_{j}, t = 0) =

+

T_{construction} (R_{j}) = ∣ S W_{j} ∣ lo g_{R_{d e g r ee}} ∣ S W_{j} ∣.

T_{construction} (R_{j}) = ∣ S W_{j} ∣ lo g_{R_{d e g r ee}} ∣ S W_{j} ∣.

T_{d o m (u)} (S W_{j}) =

T_{d o m (u)} (S W_{j}) =

[(1 - v_{u^{m i n}} + θ_{L})^{d} - (1 - v_{u^{m i n}} - θ_{L})^{d}],

T_{d o m} (N_{j}, t = 0) = ∣ S W_{j} ∣ \times T_{d o m (u)} (S W_{j}) .

T_{d o m} (N_{j}, t = 0) = ∣ S W_{j} ∣ \times T_{d o m (u)} (S W_{j}) .

T_{r - d o m (u)} (S W_{j}) =

T_{r - d o m (u)} (S W_{j}) =

[(1 - v_{u^{m a x}} - θ_{L})^{d} - (1 - v_{u^{m a x}} - 2 θ_{L})^{d}],

T_{r - d o m} (N_{j}, t = 0) = ∣ S W_{j} ∣ \times T_{r - d o m (u)} (S W_{j}) .

T_{r - d o m} (N_{j}, t = 0) = ∣ S W_{j} ∣ \times T_{r - d o m (u)} (S W_{j}) .

T_{update} (R_{j}) =

T_{update} (R_{j}) =

=

T_{extract} (T K S_{j}) = 2 \times ∣ S W_{j} ∣ + ∣ S W_{j} ∣ lo g_{2} ∣ S W_{j} ∣.

T_{extract} (T K S_{j}) = 2 \times ∣ S W_{j} ∣ + ∣ S W_{j} ∣ lo g_{2} ∣ S W_{j} ∣.

T_{average} (N_{j}, t = 0) =

T_{average} (N_{j}, t = 0) =

+

Δ t =

Δ t =

T_{average} (N_{j}, t > 0) =

T_{average} (N_{j}, t > 0) =

+

T_{average} (N_{j}) =

T_{average} (N_{j}) =

+

PKsky (u) =

PKsky (u) =

SP_{average} (C S) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Probabilistic Top-k Dominating Query Monitoring over Multiple Uncertain IoT Data Streams in Edge Computing Environments

Chuan-Chi Lai, Tien-Chun Wang, Chuan-Ming Liu, and Li-Chun Wang C.-C. Lai and L.-C. Wang are with Department of Electrical and Computer Engineering, National Chiao Tung University, 300 Hsinchu, Taiwan (E-mail: [email protected]; [email protected]). T.-C. Wang is with Department of mobile device development, Compal Electronics, Inc., Taiwan (E-mail: [email protected]). C.-M. Liu is with Department of Computer Science and Information Engineering, National Taipei University of Technology, 10618 Taipei, Taiwan (E-mail: [email protected]).

Abstract

Extracting the valuable features and information in Big Data has become one of the important research issues in Data Science. In most Internet of Things (IoT) applications, the collected data are uncertain and imprecise due to sensor device variations or transmission errors. In addition, the sensing data may change as time evolves. We refer an uncertain data stream as a dataset that has velocity, veracity, and volume properties simultaneously. This paper employs the parallelism in edge computing environments to facilitate the top-k dominating query process over multiple uncertain IoT data streams. The challenges of this problem include how to quickly update the result for processing uncertainty and reduce the computation cost as well as provide highly accurate results. By referring to the related existing papers for certain data, we provide an effective probabilistic top-k dominating query process on uncertain data streams, which can be parallelized easily. After discussing the properties of the proposed approach, we validate our methods through the complexity analysis and extensive simulated experiments. In comparison with the existing works, the experimental results indicate that our method can improve almost 60% computation time, reduce nearly 20% communication cost between servers, and provide highly accurate results in most scenarios.

Index Terms:

Big Data, Internet of Things, Uncertain Data, Multiple Data Streams, Top-k Dominating.

I Introduction

Big data analysis has been widely applied in many fields in recent years. The well-known characteristics of big data are the following Vs: Volume, Velocity, Variety, Veracity, Variability, and Value. Many modern applications and services need to deal with big data from multiple sources. Such a way can be recognized as a computing model over multiple uncertain data streams. For example, some specific applications, Massive Internet of Things (Massive IoT) [1], Smart City [2], and Location-Based Service (LBS) [3], can be recognized as the implementations of a distributed/parallel sensing data processing model with multiple input uncertain data streams. The afore mentioned applications match at least three big data’s characteristics: volume, velocity, and veracity. The volume of information is growing all the time so that an efficient parallel or distributed computing way is required. In massive IoT environments, the real-time monitoring is a typical application for detecting the events that need to be avoided or alleviated. In this case, the users/operators only concern the latest results for most queries and thus the information has time-limited (or velocity) feature. Due to the unreliability of data retrieval process, many data are inaccurate or uncertain. In such a case, the probabilities are used to represent the distribution of different situations. Hence, a massive IoT application has to effectively process the uncertain data to provide near real-time results with high precision (or veracity).

Although the big data can be resolved by the Cloud Computing model, the response time (or latency) still can not meet the requirements of some near real-time IoT monitoring applications. Edge Computing [4, 5] thus has become the promising architecture to improve the response time for IoT applications in recent years. Most researchers focus on developing new techniques to edge computing from system design, communications, networking, and resource management [6, 7, 8] aspects. However, developing new effective techniques to process the IoT data efficiently from data science/engineering aspects is also very important and helpful to the IoT applications. Many researchers thus have proposed some algorithms for different types of queries (demands) to find the insightful knowledge in big data and make the precise decision. Skyline [9, 10, 11] and Top- $k$ [12, 13, 14] queries are common research topics. However, the skyline and top- $k$ queries lead to some discrepancies in the search results. Nowadays, such two queries cannot satisfy the demand of some modern applications. Therefore, a new query, Top- $k$ Dominating [15, 16, 17, 18], in certain data combined the above two search features comes into being.

In general, an uncertain data object is usually modeled with multiple probabilities which represents the probabilities of the object’s occurrences or errors for some applications, such as IoT data analysis. Such a data model makes the query process much more complicated. Some works [19, 20, 21] have discussed the Probabilistic Top- $k$ Dominating (PTD) query processing on uncertain data streams. In traditional, to handle probabilistic top- $k$ dominating queries, the system will compute the dominant scores between different data objects and find out $k$ objects having the highest dominant scores. Such a straightforward process needs $O((n|U|)^{2})$ computation time, where $n$ is the number of instances in an object and $U$ is the input data set. As the amount of data increases dramatically, the system needs solutions to effectively reduce the computational complexity. In the traditional centralized systems, R-tree [22] is one of the most popular indexing structure to improve the performance of query processing. Due to the spatial characteristic of R-tree, the system can get a great performance improvement on following operations: object search, value comparison, and pruning. However, utilizing centralized data structures and algorithms already can not handle the big data lead by the IoT era. Therefore, it is reasonable to improve the efficiency of computations using modern parallel and distributed computations. In this paper, we propose a Probabilistic Top- $k$ Dominating query process over Multiple Uncertain data Streams (PTDMUS) algorithm to improve the efficiency of searching $k$ data objects that have the highest dominate scores for the distributed real-time IoT monitoring applications. The contributions of this work are listed as follows.

•

We provide a parallel processing model utilizing the R-trees, $k$ -skyband [13], and a threshold for effectively precluding irrelevant objects in advance, and thereby significantly reduce the computational overhead. Such an idea can be used to solving some other similar types of queries.

•

We propose an estimated theorem for the distributed computing environments to effectively predict the time that a data object has the chance to become the final result, and consequently decreases the frequency of dominance checks on the edge computing nodes.

•

In addition, we present the theoretical analysis of PTDMUS on time complexity, space complexity, and transmission cost in the average and worst cases.

•

The simulation result indicates that PTDMUS outperforms the conventional method with 60% computation time and 20% transmission cost while keeping near 100% precision and recall of final query result in most scenarios.

The rest of paper is organized as follows. Some related researches are reviewed in Section II. Section III presents the definitions, notations, and problem statement of this work. Section IV discusses the proposed solutions with some algorithms and running examples in details. Some theoretical analysis and discussion are explained in Section V. Simulation results are presented in Section VI. Finally, we give concluding remarks in Section VII.

II Related Work

Many researchers have discussed range, skyline, and top- $k$ queries over uncertain data in distributed computing environments. Nowadays, the above query types can not satisfy the demand of some modern applications. Hence, we focus on a more complex query, top- $k$ dominating query, in this work. In the balance of this section, we introduce the related works about top- $k$ dominating query processing from the data science aspect. The comparisons of conventional works are summarized in Table I and each work will be described in followings.

Miao et al. [15] proposed a Bitmap Indexing Guided (BIG) algorithm for improving the performance of processing top- $k$ dominating query on large incomplete dataset. Han et al. [16] provided a table-scan-based method with presorted results for improving the performance/efficiency of top- $k$ dominating query computations on massive data in batch computing model. Amagata et al. [17] mapped multiple input datasets into a data space and then proposed a method which generates virtual points for effectively precluding unnecessary data objects in the data space. Ezatpoor et al. [18] applied BIG algorithm [15] to MapReduce framework for providing a parallel computing model to enhance the performance of processing top- $k$ dominating query on large incomplete dataset. However, only [17] and [18] proposed the algorithms for distributed computing environments. Furthermore, the above approaches for certain data did not support continuous query processing in real-time IoT monitoring applications.

For uncertain data, only few studies [21, 20, 19] have explored the top- $k$ dominating query processing until now. Zhang et al. [19] proposed a threshold-based algorithm to prune the irrelevant objects and thus improved the performance of computation for top- $k$ dominating query. Zhan et al. [20] developed new pruning techniques by utilizing the spatial indexing and statistic information while considering the maximum/upper and minimum/lower bounds of probabilistic dominance, which reduced computational and I/O costs. Li et al. [21] proposed a method to postpone the unnecessary calculation if the query results did not change dramatically in a certain period of time and the computational cost could be reduced. However, these works did not consider how to process continuous queries over uncertain data with parallelisms for real-time IoT monitoring applications based on edge computing environments.

In summary, to the best of our knowledge, none of existing works simultaneously consider following characteristics: uncertain data, continuous probabilistic top- $k$ dominating query, distributed computing, and the real-time requirement for IoT monitoring. This shows that probabilistic top- $k$ dominating query processing over uncertain data for edge-enabled IoT real-time monitoring applications remains a big challenge.

III Preliminaries

In this section, we introduce the fundamental assumptions, the system model, and the problem statement.

III-A Fundamental Assumptions

Three kinds of uncertain data models have been proposed and discussed in [23]: fuzzy, evidence-oriented, and probabilistic models. In this work, we refer to the last model with discrete case and the uncertain data object can be defined as Definition 1.

Definition 1 (Uncertain Data Objects).

Given a $d$ -dimensional uncertain data set $U$ , each uncertain data object $u\in U$ with $n$ instances is a probability distribution over the $d$ -dimensional space. Each instance $u^{a}$ of $u$ has $d$ attributes, $u^{a}[1],u^{a}[2],\cdots,u^{a}[d]$ , where $a=1,\dots,n$ , and is associated with a probability $Pr(u^{a})$ , where $Pr(u)=\sum_{a=1}^{n}Pr(u^{a})=1$ .

A simple example of a two-dimensional uncertain data set is presented in Table II, in which each uncertain data object has three possible instances. For example, object $u_{1}$ has three instances $u_{1}^{1}$ , $u_{1}^{2}$ , and $u_{1}^{3}$ with probabilities $0.4$ , $0.3$ , and $0.3$ , respectively. It means that $u_{1}$ may occur in three possible cases with different corresponding probabilities and the total probability of all cases will be 1. Note that we will use attribute or dimension interchangeably.

If we map the instances of the uncertain data objects onto a $d$ -dimensional space, each uncertain data object $u$ can be represented by a minimum bounding rectangle, MBR( $u$ ), which is the minimum rectangle containing all the instances of $u$ in the space. Let $u^{\max}$ and $u^{\min}$ respectively denote the maximum and minimum corners of $u$ where $u^{\max}[\alpha]=\max_{1\leq a\leq n}{u^{a}[\alpha]}$ and $u^{\min}[\alpha]=\min_{1\leq a\leq n}{u^{a}[\alpha]}$ , where $\alpha=1,\ldots,d$ . Then, MBR( $u$ ) can be represented by $[u^{\min},u^{\max}]$ where $u^{\min}=(u^{\min}[1],u^{\min}[2],\dots,u^{\min}[d])$ and $u^{\max}=(u^{\max}[1],u^{\max}[2],\dots,u^{\max}[d])$ . Note that $u^{a}[0]$ is the probability value $Pr(u^{a})$ of instance $u^{a}$ . According to the example in Table II, $u_{1}^{\max}[1]=35,u_{1}^{\min}[1]=28,u_{1}^{\max}[2]=11,$ and $u_{1}^{\min}[2]=7$ , so MBR( $u_{1}$ ) is $[u_{1}^{\min},u_{1}^{\max}]=[(28,7),(35,11)]$ . Fig. 1 shows the MBRs of each data objects indexed by an R-tree for the example in Table II. The four uncertain data objects $u_{1}$ , $u_{2}$ , $u_{3}$ , and $u_{4}$ , on a 2D plane with the associated MBRs and each object has three instances respectively. Note that we use the bulk loading algorithm [24] to construct the R-trees [22] in our work since it can utilize the space and avoid the overlapping issue between MBRs, thus improving the query time.

We consider multiple uncertain data streams and denote an Uncertain data Stream as US, where the uncertain data objects are generated with time and will be invalid after a period of time. Data streams play an important role in the era of big data with the advance of IoT technology and have attracted much attention for years. Most of the related researches use the sliding window model and focus on the recent data in the stream. Our work follows this trend.

In a data stream, each data object has a time stamp to denote the time for entering the system. This scenario is usually modeled as the sliding window and it can be defined as below.

Definition 2 (Sliding Window).

Suppose the sliding window, SW, is of size $|SW|$ . Then the data newly generated will be invalid after $|SW|$ time instances. We use $SW[t-|SW|+1,t]$ to denote the set of the uncertain data objects in the current sliding window at time $t$ . The considered sliding window follows the first-in-first-out rule for keeping the objects.

In this paper, we use $u_{1},u_{2},\dots,u_{|\mbox{\emph{SW}}|}$ to denote the uncertain data objects in SW (i.e., SW= $\{u_{1},u_{2},\dots,u_{|\mbox{\emph{SW}}|}\}$ , according to the arrival time of each data object. To search the top- $k$ dominating objects, the system will use the dominant scores obtained from the skyline query. In our work, we assume that the value of a dimensional attribute is the smaller, the better. The dominant relations between instances can be defined as follows.

Definition 3 (Instance-level Dominance).

Given two uncertain data instances of two different uncertain data objects $u_{i}^{a}$ and $u_{j}^{b}$ where $a,b\in[1,n]$ and $i\neq j$ , if the condition $(\forall\alpha\in[1,d],u_{i}^{a}[\alpha]\leq u_{j}^{{b}}[\alpha])\wedge(\exists\beta\in[1,d],u_{i}^{a}[\beta]<u_{j}^{b}[\beta])$ holds, we say $u_{i}^{a}$ dominates $u_{j}^{b}$ and it is denoted as $u_{i}^{a}\prec u_{j}^{b}$ .

In short, none of $u_{j}^{b}$ ’s attributes is better (smaller and except for equal) than $u_{i}^{a}$ ’s the corresponding dimensional attribute. Since a data object may has multiple instances, we can classify the object-level dominance into three cases. The relevant definitions are presented in the following.

Definition 4 (Object-level Dominance).

Suppose there are two uncertain data objects $u_{i}$ and $u_{j}$ and each object has $n$ instances. If $u_{i}$ is considered as a dominator, the relation between $u_{i}$ and $u_{j}$ can be classified by using following cases:

Complete Dominance: all the instances of $u_{i}$ dominate all the instances of $u_{j}$ , denoted as $u_{i}\prec u_{j}$ . 2. 2.

Partial Dominance: some instances of $u_{i}$ dominate some instances of $u_{j}$ , denoted as $u_{i}\precsim u_{j}$ . 3. 3.

Missing Dominance: no instances of $u_{i}$ dominate any instance of $u_{j}$ , denoted as $u_{i}\nprec u_{j}$ .

In summary, the probability of $u_{i}$ dominating $u_{j}$ can be generally expressed as

[TABLE]

According to the above definitions, we can derive the score of the dominant relation between two objects according to the following definition.

Definition 5 (Dominant Score of an Object).

Given an uncertain data object $u_{i}$ with $n$ instances, the expected dominant score of an instance $u_{i}^{a}$ can be derived by

[TABLE]

Then, the dominant score of the uncertain object $u_{i}$ is defined as

[TABLE]

Consider the example in Table II and Fig. 1, where instances $u_{1}^{1}$ , $u_{1}^{2}$ , and $u_{1}^{3}$ dominate the following instances: $u_{3}^{1}$ , $u_{3}^{2}$ , $u_{3}^{3}$ , $u_{4}^{1}$ , $u_{4}^{2}$ , and $u_{4}^{3}$ by Definition 3. In other words, by Definition 4, object $u_{1}$ completely dominates objects $u_{3}$ and $u_{4}$ , denoted as $u_{1}\prec u_{3}$ and $u_{1}\prec u_{4}$ . The derivation of $dom(u_{1})$ can be presented as

[TABLE]

For object $u_{2}$ , it completely dominates object $u_{3}$ and partially dominates object $u_{4}$ , so the calculation of $dom(u_{2})$ will be

[TABLE]

Consequently, we can obtain the dominant scores of all the uncertain objects in the same way.

III-B System Architecture

In this work, we construct an edge computing system as shown in Fig.2 and make it support the parallel and distributed computing for monitoring the top- $k$ query over multiple uncertain IoT data streams. Such a way can improve the efficiency of computation. The system consists of a coordinator node (cloud service) $N_{H}$ and $m$ monitor nodes (edge computing nodes) $N_{1},N_{2},\dots,N_{m}$ . Each monitor node $N_{j}$ can directly contact with the coordinator node $N_{H}$ , where $1\leq j\leq m$ . For $N_{H}$ , all the reported information from each $N_{j}$ is recognized as an uncertain data steam $US_{j}$ . Each $N_{j}$ needs to continuously compute the local result of the query and upload it to $N_{H}$ as the candidate result. $N_{H}$ needs to record all the unexpired candidates that are received from each $N_{j}$ .

III-C Problem Statement

Our objective is to have a time-efficient approach determining the uncertain objects with top- $k$ dominant scores among all the uncertain objects in the considered system model. A global sliding window $SW_{H}$ and $m$ uncertain data streams $US_{1},US_{2},\ldots,US_{m}$ are given. Each $US_{j}$ is corresponding to the monitor node $N_{j}$ where $1\leq j\leq m$ . Each $N_{j}$ has its local $SW_{j}$ and $|SW_{H}|=m*|SW_{j}|$ . Each $N_{j}$ examines the objects in $SW_{j}$ , saves the possible objects in a local candidate set, and then reports the local candidate set to the coordinator node $N_{H}$ . $N_{H}$ uses the received local candidate sets to calculate the global candidate set and then broadcasts it to each $N_{j}$ . Each $N_{j}$ uses the received global candidate set to derive the dominant scores of the objects that dominate others and then returns the scores to $N_{H}$ . After that, $N_{H}$ integrates the received score information of each object and finds out $k$ data objects that have the highest scores. The final result set including Probabilistic Top- $k$ Dominating objects is denoted as $PTD$ . Note that the above process are repeatedly operated until there is no input data.

According to the above assumptions, there are three important issues to be solved:

How to avoid the unnecessary computations for the dominant scores in order to save the computation time? 2. 2.

How to minimize the number of local candidate objects for improving the transmission cost? 3. 3.

How to reduce the frequency of dominant score derivations as time evolves?

IV Probabilistic Top-k Dominating Query Process over Multiple Uncertain Data Streams (PTDMUS)

In this section, we present the proposed approach, Probabilistic Top-k Dominating Query Process over Multiple Uncertain Data Streams (PTDMUS). PTDMUS provides three mechanisms to solve the above three issues and we respectively introduce each of them in detail.

IV-A The Computation with R-trees

In the first part, we apply R-trees to the considered system for improving the computational speed of the dominant score derivation. Note that we use [25] to generate bulk loading R-trees and minimize the overlaps of elements in each level, thereby optimizing the searching time. By combining the characteristics of an MBR in Definition 4, we can define the dominant relation between different MBRs as Definition 6.

Definition 6 (Dominance between different MBRs).

Given two different minimum bounded rectangles $MBR(u_{1})$ , $MBR(u_{2})$ of object $u_{1}$ and $u_{2}$ respectively, if we consider $MBR(u_{1})$ as a dominator, we can classify the relation between $MBR(u_{1})$ and $MBR(u_{2})$ as following cases:

Complete Dominance: $u_{1}^{max}$ of $MBR(u_{1})$ is smaller than of $u_{2}^{min}$ of $MBR(u_{2})$ , denoted as $MBR(u_{1})\prec MBR(u_{2})$ . 2. 2.

Partial Dominance: $u_{1}^{min}$ of $MBR(u_{1})$ is smaller than of $u_{2}^{max}$ of $MBR(u_{2})$ , denoted as $MBR(u_{1})\precsim MBR(u_{2})$ . 3. 3.

Missing Dominance: $u_{1}^{min}$ of $MBR(u_{1})$ is larger than of $u_{2}^{max}$ of $MBR(u_{2})$ , denoted as $MBR(u_{1})\nprec MBR(u_{2})$ .

Using the cases in Definition 6, the system can preclude irrelevant objects effectively and thus improve the computation overhead for dominant scores. For example, to derive the dominant score of $u_{i}$ , the system uses MBR( $u_{i}$ ) and the R-tree as inputs. The system will put all the children of the root in a Target Set ( $TS$ ) and then examine the relation between MBR( $u_{i}$ ) and each the element $e_{k}$ in $TS$ :

Complete Dominance: if $e_{k}$ is an MBR node, add the number of objects in $e_{k}$ to $dom(u_{i})$ ; otherwise, $e_{k}$ is an object and $dom(u_{i})=dom(u_{i})+1$ . 2. 2.

Partial Dominance: if $e_{k}$ is an MBR node, put all the children of $e_{k}$ in the Next Target Set ( $NTS$ ); otherwise, $e_{k}$ is an object and $dom(u_{i})$ directly is added to the dominant score of $u_{i}$ with respect to $e_{k}$ . 3. 3.

Missing Dominance: do nothing.

After examining all the elements in $TS$ , if $NTS$ is not empty, the system will clear $TS$ and insert all the elements of $NTS$ to $TS$ . The system will do the above operations repeatedly until $TS$ is empty and can obtain the dominant scores of $u_{i}$ with respect to all the objects in the R-tree. The above computation process with R-tree will be executed on monitor nodes in PTDMUS.

IV-B Threshold-based Probabilistic k-skyband

To reduce the number of candidate objects, [13] proposed a $k$ -skyband approach for the top- $k$ query on certain data. In our work, we follow this idea to define a probabilistic $k$ -skyband for minimizing the size of candidate set. First, we define the dominated score of an uncertain object as follows.

Definition 7.

Given an uncertain data object $u_{i}$ in the sliding window $SW$ , the score of an instance $u_{i}^{a}$ being dominated (or called dominated score) is defined as

[TABLE]

where $u_{j}^{b}$ is an instance of $u_{j}$ , and the dominated score of $u_{i}$ is

[TABLE]

After obtaining the dominated score of each object, we can define the probabilistic $k$ -skyband ( $KS$ ) as follows.

Definition 8 (Probabilistic k-skyband).

For a given integer $k$ , the probabilistic $k$ -skyband ( $KS$ ) is a set of uncertain data objects and $KS=\{u\in U|dom(u)\geq 1\wedge r\text{-}dom(u)<k\}$

Note that the top- $k$ dominating result is always a part of the $k$ -skyband [26] in certain data. For uncertain data, we can also have a similar property as shown in Theorem 1

Theorem 1.

Given a set of probabilistic top- $k$ dominating objects $PTD$ , $u\in PTD$ , if $dom(u)\geq 1$ , then $u\in KS$ .

Proof.

If $u\notin KS$ , then $r$ - $dom(u)\geq k$ according to Definition 8. In this case, at least $k$ other objects dominate $u$ in average. It is hence impossible for $u$ to be one of the top- $k$ dominating objects and $u\notin PTD$ . This contradicts to the given condition and the proof is done. ∎

In order to use the property in Theorem 1 for processing probabilistic top- $k$ dominating queries in parallel, we derive Threshold-based Probabilistic $k$ -skyband by giving a threshold $\delta\leq k$ .

Definition 9 (Threshold-based Probabilistic k-skyband).

For a given integer $k$ and a threshold value $\delta\leq k$ , the threshold-base probabilistic $k$ -skyband (TKS) is a subset of $KS$ and $TKS=\{u\in U|dom(u)\geq 1\wedge r\text{-}dom(u)<\delta\leq k\}$

This method is used to solve the second issue we mentioned in the problem statement and reduce the size of the candidate set in each monitor node. In this way, the coordinator node can also process less received candidate objects that are possible to be the top- $k$ dominating objects. In conventional methods [26, 27], the $k$ -skyband is computed on both monitor and coordinator nodes. However, in the most modern big data applications, such a way is not efficient since the volume of candidates are usually still too large for the top- $k$ dominating objects considered by users. The coordinator node still needs too much computational cost on the $k$ -skyband calculation and makes the response time intolerable to users. Hence, in the proposed PTDMUS approach the coordinator node uses a new mechanism, Minimum Checking Time (MCT), to help efficiently derive the final result instead of computing the global threshold-based probabilistic $k$ -skyband as the candidate set, $CS$ . Note that $CS=\bigcup_{j=1}^{m}TKS_{j}$ and $TKS_{j}$ is the local result of threshold-based probabilistic $k$ -skyband from the monitor node $N_{j}$ . Such a mechanism can help the coordinator node process the objects that are really relevant to the query, decrease the frequency of score derivation on irrelevant objects, and significantly reduce the computational cost as well as improve the response time.

IV-C Minimum Checking Time

We discover an important phenomenon that each monitor node usually uploads the candidate set that is very similar to the previously uploaded one in most scenarios. In other words, the received local candidates from each monitor node do not often change dramatically as time moves. Therefore, we can record the statuses of candidate objects to make each monitor node only upload the objects that need to be updated. Such a way can alleviate the transmission cost mentioned in the second issue.

In this paper, we use a table, checking-time table ( $CT$ ), to record the statuses of received data objects on the coordinator node $N_{H}$ . The status of each received data object $u$ will be stored in one entry of $CT$ until the lifetime of $u$ is out. Note that the lifetime of a data object is equal to the length of $N_{H}$ ’s sliding window $SW_{H}$ . The coordinator node thus only needs to update the status of data objects in $CT$ if necessary, and then calculates the final result. In general, most of the objects will not be in the final result. Hence, we can use a predictive way to determine the minimum checking time for the coordinator node. With $CT$ and the minimum checking time derivation, the server can only do the computation if the result will change. Such an idea comes from the conventional work [21] in centralized database systems. We thereby propose a new distributed version theorem for dynamically determining the minimum checking time to update the result set $PTD$ in distributed environments.

Theorem 2 (Minimum Checking Time).

Suppose that the notations are defined as above, the minimum checking time for the coordinator node represents the lower bound of expected time that the result set of probabilistic top- $k$ dominating objects $PTD$ will change, and it can be derived by

[TABLE]

where $exp_{min}$ is the nearest (smallest) expired time of an object in the set of $PTD$ , $dom_{k}$ is the $k$ -th highest dominant score of the objects in $PTD$ , $t_{cur}$ is current time.

Proof.

When $u\in CS$ and $u\notin PTD$ , $u$ has a chance to be in $PTD$ if one of following two cases is satisfied:

some objects in $PTD$ are expired; 2. 2.

$dom(u)\geq dom_{k}$ .

Our objective is to obtain the minimum time that $u$ can be in $PTD$ . For Case 1, the process will search the minimum expired time of all objects in $PTD$ and it is depicted as $exp_{min}$ . For Case 2, if object $u_{k}$ is the object with $k$ -th highest dominant score in $PTD$ and object $u_{old}$ is going to be removed from $PTD$ . Removing $u_{old}$ from $PTD$ will reduce the difference gap, $dom_{k}-dom(u)$ , between $u$ and $u_{k}$ . It means that some objects $u_{old}$ in $PTD$ result in $dom(u)\geq dom_{k}$ . In general, there could be many old objects like $u_{old}$ . The system needs to remove at least $\lfloor\dfrac{dom_{k}-dom(u)}{n}\rfloor$ old objects to remain the minimum number of necessary objects in $PTD$ , and then $dom(u)\geq dom_{k}$ holds. Since each run of the computation can remove $m$ old objects like $u_{old}$ , the minimum time period needs to be divided by $m$ and it will be $\lfloor\dfrac{dom_{k}-dom(u)}{mn}\rfloor$ . After that, add the obtained minimum time period to the current time and thus get the predicted time, $\lfloor\dfrac{dom_{k}-dom(u)}{mn}\rfloor+t_{cur}$ , that $dom(u)\geq dom_{k}$ . Finally, the minimum checking time is updated by $\min(exp_{min},\lfloor\dfrac{dom_{k}-dom(u)}{mn}\rfloor+t_{cur})$ with respect to an object $u$ , which is denoted as $mct(u)$ . ∎

While computing $mct(u)$ of each object $u$ , if the obtained time is equal to $t_{cur}$ , it means that $u$ has a chance to become the result in this run but the dominant score of $u$ is not large enough to make $u\in PTD$ . Then, if $u\in SW_{H}$ during the next run of computation, the system will record each $u$ in the checking-time table $CT$ and set $mct(u)=t_{cur}+1$ for the next run of computation. In summary, with the proposed theorem, the checking-time table $CT$ acts as a priority cache table and helps the $N_{H}$ effectively reduce the frequency of computation for the third issue.

IV-D The Process of PDTMUS

In fact, the overall process of PDTMUS has been briefly described in Section III-C. In this subsection, we show the whole process using the pseudo-code in Algorithm 1. The system executes Algorithm 1 recursively until no input data coming in. At Line 1 of Algorithm 1, each monitor node $N_{j}$ inserts the received data objects in $US_{j}$ to the local sliding widow $SW_{j}$ at time $t$ and removes the oldest data objects in $SW_{j}$ due to the size limitation of a local sliding widow. At Line 1, each $N_{j}$ pre-processes all the local objects in $SW_{j}$ for constructing the local R-tree $R_{j}$ as well as obtaining the information of dominant relations between MBRs using Definition 6. Each $N_{j}$ then computes the local candidate (local $k$ -skyband) set $CS_{j}^{t}$ with Definition 5, Definition 7, and Definition 9. Note that $CS_{j}^{t}=TKS_{j}^{t}$ and $TKS_{j}^{t}$ is the local result of threshold-based top- $k$ dominating objects from the monitor node $N_{j}$ . If $t=0$ holds at Line 1, it means that the whole precess is in the initial phase and each $N_{j}$ at Line 1 will upload the whole candidate set $CS_{j}^{t}$ to the coordinator node $N_{H}$ ; otherwise, each $N_{j}$ only needs to upload the necessary update information to $N_{H}$ at Line 1.

After $N_{H}$ receives each local candidate set from each $N_{j}$ in $SW_{H}$ , $N_{H}$ derives the global candidate set $CS^{t}$ in the same way (using Definition 7 and Definition 8) at Line 1. The coordinator node $N_{H}$ then broadcasts the global candidate set to every $N_{j}$ at Line 1 and asks $N_{j}$ for helping the local computation. Each $N_{j}$ derives the dominant and dominated scores, $dom(u)$ and $r$ - $dom(u)$ , of all the objects in $CS_{j}^{t}$ and uploads the updated $CS_{j}^{t}$ to $N_{H}$ using Definition 9 at Lines 1 and 1. From Lines 1 to 1, $N_{H}$ uses the received information of dominated scores to update $CS^{t}$ , finds the final global result $PTD^{t}$ for time $t$ , and then updates the minimum checking times of the objects that may be the answer at time $t+exp_{min}$ . $N_{H}$ broadcasts the information of checking time in $CT^{t}$ to every $N_{j}$ at Line 1 using Theorem 2. With the checking time table $CT^{t}$ , each $N_{j}$ can determine the appropriate time of the next round of update/derivation and thus effectively reduce the frequency of computation. Such a way can avoid a lot of unnecessary computation. In the last, the system returns $PTD^{t}$ as the final result to the user.

V Complexity Analysis

After introducing the proposed process of PTDMUS, we analyze and discuss its time complexity, space complexity, and transmission cost in both the average case and the worst case, respectively.

V-A Time Complexity

In the first run (time slot $t=0$ ) of the PTDMUS process, mentioned in the previous section, each monitor node $N_{j}$ takes time on constructing a local R-tree, $R_{j}$ , with all the data objects in $SW_{j}$ at the initial step, deriving $dom(u)$ and $r$ - $dom(u)$ of each $u$ in $R_{j}$ at the second step, and extracting the threshold-based top- $k$ dominating objects into $TKS_{j}$ at the last step. Hence, the time complexity of PTDMUS on a monitor node $N_{j}$ can be expressed as

[TABLE]

In PTDMUS, the time complexity is related to maintaining and searching the R-trees. According to [28] [29], the time for constructing a $d$ -dimensional R-tree is $O(\dfrac{|U|}{B}\log_{R_{degree}/B}\dfrac{|U|}{B})$ where $B$ is the block (or page) size of data on the disk (or memory), $R_{degree}$ is the degree fanout of R-tree. In this work, we deal with the uncertain data objects in a object-oriented model ( $B=1$ ), so the time for local R-tree’s construction will be

[TABLE]

In the considered environment, we assume that the data points are uniformly and independently distributed in the domain space $[0,2000]^{d}$ . To make it simple to analysis, we normalize the space into $[0,1]^{d}$ . According to [30], $R_{j}$ ’s height $h_{j}$ and the number of nodes $N_{L}$ at level $L$ (let the leaf level be [math]) will be approximately $h_{j}=1+\lceil\log_{R_{degree}}(|SW_{j}|/R_{degree})\rceil$ and $N_{L}=|SW_{j}|/(R_{degree})^{L+1}$ , respectively. Besides, the extent $\theta_{L}$ (i.e., length of any 1D projection) of a node at the $L$ -th level can be estimated by $\theta_{L}=(1/N_{L})^{1/d}$ and some nodes in the $L$ -th level may be partially dominated by $u$ . Fig. 3(a) shows that the gray region $I_{2}$ corresponds to the maximal region, covering nodes (at level $L$ ) that are partially dominated by $u^{\min}$ . Then, the average number of required node accesses in the R-tree for computing the dominant score $dom(u)$ of object $u$ will be [31]

[TABLE]

where $v_{u^{\min}}$ is the value of $u^{\min}$ after the 1D projection and $n$ is the number of instances in an object. Hence, the time complexity of dominance update on the monitor node can be expressed as

[TABLE]

To obtain the local threshold-based probabilistic $k$ -skyband, the monitor node $N_{j}$ also needs to traverse the $R_{j}$ to derive the $r$ - $dom(u)$ of each object $u$ in the $SW_{j}$ . Fig. 3(b) shows that the gray region $I_{2}^{\prime}$ corresponds to the maximal region, covering nodes (at level $L$ ) that partially dominate $u^{\max}$ . The average number of required node accesses in the R-tree for computing the $r$ - $dom(u)$ of object $u$ will be

[TABLE]

where $v_{u^{\max}}$ is the value of $u^{\max}$ after the 1D projection. Hence, the time complexity of $k$ -skyband update on a monitor node can be derived by

[TABLE]

That is, with (4) and (5), the time complexity of the second step on $N_{j}$ will be

[TABLE]

where $\forall u\in SW_{j}$ . In the last step, $N_{j}$ will copy all the objects in $SW_{j}$ to a temporary candidate list $CS_{j}^{\prime}$ , sort the objects in decreasing order by $dom(u)$ using merge-sort where $\forall u\in CS_{j}^{\prime}$ , and then use the threshold $\delta$ to extract the local probabilistic $k$ -skyband. Therefore, the time complexity of $\text{T}_{\text{extract}}(TKS_{j})$ can be denoted as

[TABLE]

In summary, with (3) to (V-A), we can express (V-A) as

[TABLE]

where $\forall u\in SW_{j}$ .

Note that PTDMUS needs to monitor the result of the top- $k$ dominating query continuously in a monitoring time period $\varDelta t$ and $\varDelta t$ is set by

[TABLE]

After the first run (time slot), the coordinator node $N_{H}$ will broadcast the global candidate set $CS^{t}$ at time $t$ with the minimum checking time, $exp_{min}$ , to each monitor node $N_{j}$ . Each $N_{j}$ can use the received information to reduce the computational overhead during the next computation of local result when $t>0$ . In the following time slots, $N_{j}$ uses the candidate set of previous run, $CS^{t-exp_{min}}$ , to construct the global R-tree, $R_{H}^{t}$ , for dominance checks instead of using $R_{j}$ . In practice, we use two temporary lists, $DO_{j}^{t}$ and $NO_{j}^{t}$ , to help the update of candidate set during the time period $(t-exp_{min},t]$ . $DO_{j}^{t}$ is used to record the objects that are going to be deleted where $DO_{j}^{t}=\{SW_{j}^{t-exp_{min}}[0],SW_{j}^{t-exp_{min}}[1],\dots,SW_{j}^{t-exp_{min}}[exp_{min}-1]\}$ . $NO_{j}^{t}$ is used to stored the new input objects that are going to be added where $NO_{j}^{t}=\{SW_{j}^{t}[|SW_{j}|-1],SW_{j}^{t}[|SW_{j}|-2],\dots,SW_{j}^{t}[|SW_{j}|-exp_{min}]\}$ . Thus, the exact data set $UO_{j}^{t}$ that needs to be processed at time $t$ becomes $CS^{t-exp_{min}}\cup NO_{j}^{t}-DO_{j}^{t}$ and $N_{j}$ uses $UO_{j}^{t}$ to construct the new local $R_{j}^{t}$ for computing $TKS_{j}^{t}$ . Using $UO_{j}^{t}$ to substitute $SW_{j}$ with (3) to (V-A), the time complexity of a derivation run on $N_{j}$ at time $t$ can be obtained by

[TABLE]

where $\forall u\in UO_{j}^{t}$ . In summary, the average complexity during the time $\varDelta t$ will be

[TABLE]

In fact, the derived costs $\text{T}_{\text{average}}(N_{j},t=0)$ in PTDMUS and PTDSky methods are similar since both of them use monitor nodes to derive the local $k$ -skybands. From (V-A), we can know that the computation time is significantly influenced by the computation cost of each run when $t>0$ . PTDMUS uses the minimum checking time $exp_{min}$ to reduce the frequency $f$ of derivations (or dominance checks) where $f=1+\lfloor\varDelta t/\overline{exp_{min}}\rfloor$ . If $\overline{exp_{min}}=1$ , each monitor node in PTDMUS and PTDSky will have similar computation time $\text{T}_{\text{average}}(N_{j})$ . The worst case only occurs when $\overline{exp_{min}}=1$ and $N_{j}$ always receives the global candidate set $CS^{t}=SW_{H}$ . In such a scenario, the set $UO_{j}^{t}$ ’ needed to be process at each time slot $t$ will be $UO_{j}^{t}$ ’ $=SW_{H}^{t-exp_{min}}\cup NO_{j}^{t}-DO_{j}^{t}$ and $|UO_{j}^{t}$ ’ $|$ will become very large. To obtain the upper bound of the time complexity, $\text{T}_{\text{worst}}(N_{j})$ , on a monitor $N_{j}$ , we can use $\overline{exp_{min}}=1$ and substitute $UO_{j}^{t}$ ’ for $UO_{j}^{t}$ in (V-A) and (V-A).

After analyzing the time complexity on a monitor node, time complexity on the coordinator node, $\text{T}_{\text{average}}(N_{H})$ , also needs to be discussed. However, $\text{T}_{\text{average}}(N_{H})$ depends on the size of global candidate set $CS$ , so we will discuss $\text{T}_{\text{average}}(N_{H})$ after analyzing the space complexity on $N_{H}$ in the next subsection.

V-B Space Complexity

In the considered parallel computing model, the size of global candidate set $|CS|$ in the coordinator node $N_{H}$ depends on the size of received local threshold-based probabilistic $k$ -skyband $|TKS_{j}|$ from each monitor node $N_{j}$ and the number of monitor nodes $m$ . Suppose that $\text{PKsky}(u)$ is an indicator function defined as

[TABLE]

In most application scenarios of big data, the size of local result, $|TKS_{j}|$ , is usually larger than $k$ . Thus, the average size of global candidate set $\overline{|CS|}=\text{SP}_{\text{average}}(CS)$ will be

[TABLE]

where $\overline{|TKS|}$ is the average size of the received local threshold-based probabilistic $k$ -skybands from the monitor nodes. Note that both $|CS|$ and $|TKS_{j}|$ are usually much larger than $k$ in most big data applications. In general, $|TKS_{j}|$ is much smaller than $|SW_{j}|$ due to the dominance and object pruning by the threshold.

Consider the worst case, the space complexity of candidate set $CS$ in the monitor node $N_{H}$ can be denoted as

[TABLE]

where $\overline{|SW|}$ is the average size of the sliding windows in monitor nodes. The worst case only happens when all the uncertain data objects are anti-correlated in all dimensions. It means that the condition $\forall u\in U,dom(u)=0\wedge r$ - $dom(u)=0$ holds and makes all the data objects in monitor nodes to be uploaded to the coordinator node. Thus, the space complexity of the worst case in the monitor node $N_{H}$ is $O(|SW_{H}|)$ . However, such a case is almost impossible to occur in big data environments.

After discussing the average and the worst space complexities on the coordinator node $N_{H}$ respectively, we can start discussing the time complexity of the computation on $N_{H}$ . In PTDMUS, instead of computing the global $k$ -skyband, $N_{H}$ just uses merge-sort to sort the received data objects from the monitor nodes by $dom(u)$ in a decreasing order, derives the expected checking time of $u$ , and finds the minimum checking time $exp_{min}$ at each run (time $t$ ), where $\forall u\in CS^{t}$ . Hence, the average time complexity for one run on $N_{H}$ , $\text{T}_{\text{average}}(N_{H},t\geq 0)$ , can be formulated as

[TABLE]

Due to the usage of the minimum checking time, the expected average time complexity can be derived by

[TABLE]

where $\overline{|CS|}=\text{SP}_{\text{average}}(CS)$ in (V-B) and $f=1+\lfloor\varDelta t/\overline{exp_{min}}\rfloor$ . Additionally, the worst case occurs when $\overline{exp_{min}}=1$ (or $f=\varDelta t$ ) and (12) holds. Then the worst time complexity can be obtained by

[TABLE]

V-C Transmission Cost

In general, the transmission cost depends on the sizes of local probabilistic $k$ -skybands and the global candidate set. According to the process of PTDMUS in Algorithm 1, the average transmission cost of a monitor node can be expressed as

[TABLE]

where $TKS^{t=0}$ , $CS^{t=0}$ , and $CT^{t=0}$ are respectively the local threshold-based $k$ -skyband, candidate set, and checking time table at the initial step (the fist time slot), as well as $Info_{\text{update}}$ is the minimum set of candidate objects needed to be updated at the $h\times\overline{exp_{min}}$ time slot. Note that $Info_{\text{update}}$ is expressed as

[TABLE]

In general, $|Info_{\text{update}}|$ is much smaller than $|TKS|$ and $|CS|$ . With (15), PTDMUS only needs to upload the update information $f=1+\lfloor\varDelta t/\overline{exp_{min}}\rfloor$ times during the monitor time $\varDelta t$ . By contrast, PTDSky needs to upload information at every time slot of $\varDelta t$ . In summary, Equations (13) to (V-C) are used to measure the average transmission cost of PTDMUS in the simulations.

The worst case of update cost only occurs when each input data object in the consequence time slots always becomes the top- $1$ dominating object. In this case, $Info_{\text{update}}$ will become $TKS^{t}$ and the monitor node always needs to upload $TKS^{t}$ at every time slot. In addition, the worst transmission cost on the information exchange between $N_{H}$ and $N_{j}$ occurs when $|CS^{t}|=|SW_{H}|=|CT^{t}|$ . Hence, the worst transmission cost (or network load) of a monitor node will be

[TABLE]

VI Simulation Results

The simulation including all compared approaches are implemented in JAVA with Spark using Eclipse IDE and the developed program is platform-independent. The simulation program is executed on a Windows 10 server with an Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz - 3.80GHz and 8GB $\times$ 2 memory. In this simulation, we use synthetic data and the number of uncertain data objects is 10,000. We perform three different approaches for comparisons:

•

PTDMUS performs with R-trees, threshold-based probabilistic $k$ -skyband in the monitor nodes, and PTDMUS performs with R-trees and the minimum checking time in the coordinator node;

•

PTDSky executes with R-trees and threshold-based probabilistic $k$ -skyband in both monitor and coordinator nodes [13];

•

PTDBF only runs with R-trees in a centralized way without any parallelism.

Since PTDBF is performed in a centralized server with the global information including all input data streams, PTDBF can always has the correct result of a top- $k$ dominating query. Hence, PTDBF is treated as the baseline method in the simulation. The performance of the above compared approaches is measured in terms of the computation time, transmission cost, precision, and recall, while considering the effects of threshold $\delta$ , data dimensionality, the number of monitor nodes, the size of sliding window, the value of $k$ , and the margin of uncertainty. In the previous section, both computation time and transmission cost have been detailedly analyzed in the average case and the worst case. The correctness/reliability of the proposed method is also important and thereby we validate the above methods in the simulation in terms of precision and recall. Suppose that $PTD^{t}_{\text{baseline}}$ is the result set of top- $k$ dominating objects obtained from PTDBF at time $t$ and $PTD^{t}_{\text{compared}}$ is the one obtained from PTDMUS or PTDSky at time $t$ , the precision and recall can be obtained by

[TABLE]

We perform the simulations in 20 different scenarios and each scenario is executed $\varDelta t$ runs (time slots) to get the average results and $\varDelta t$ is set by (8). The detailed setting of parameters is presented in Table III.

VI-A Threshold

As shown in Fig. 4(a), the computation time of PTDSky grows linearly as the given threshold $\delta$ increases. This is because PTDSky needs to process more candidate objects for the threshold-based $k$ -skyband when the data dimension is high ( $d=9$ ), and when the threshold $\delta$ becomes loose, the computation time of PTDSky increases. PTDBF just needs ro handle all the uncertain data objects and sorts the result by the dominant score of object directly, thus having the worst computation time which is irrelevant to the threshold. PTDMUS has the best performance on computation time since it can avoid unnecessary computation on irrelevant objects with the minimum checking time. PTDMUS can perform almost 10 times faster than PTDSky when $\delta=50$ .

According to Fig. 4(b), PTDMUS can save almost 30% transmission cost comparing to PTDSky when $\delta=50$ . In general, both PTDSky and PTDMUS need higher transmission cost since the local and the global candidate sets become large as $\delta$ increases. However, with a table recording the minimum checking times of possible candidates, the monitor and coordinator nodes in PTDMUS do not need to exchange the information of candidate sets too much if the continuous query result does not change a lot. As a result, PTDMUS can outperform PTDSky significantly. In addition, when $\delta$ increases, the global candidate set on the coordinator node becomes larger. Then PTDMUS can record more information (minimum checking times) of candidate objects, thereby avoiding the unnecessary transmission of irrelevant objects. In the rest of simulation, we choose $\delta=30$ as the default threshold. Note that PTDBF performs in a centralized way, so it doesn’t have transmission cost.

Since we use threshold-based $k$ -skyband to prune irrelevant objects in our proposed approach, we now measure its influence on the accuracy of result for the query. Fig. 4(c) and Fig. 4(d) show that PTDMUS only loses less than 0.001% performance on accuracy and recall respectively. Such a tiny performance gap can be recognized as a tolerant error. In other words, with the minimum checking time, PTDMUS can reduce transmission cost significantly with good accuracy and recall in the meantime.

VI-B Data Dimensionality

Fig. 5(a) shows that PTDBF has poor performance on computation time, especially when the dimension $d$ is small. In general, for each data object, the number of its dominated objects is large when $d$ is small. In other words, each data object has a high probability to be dominated by the other objects when $d$ is small. Hence, PTDBF needs more computations on dominance checks when $d$ is small. In addition, from the implementation perspective, PTDBF needs much more branch operations (conditions) for the dominance checks between each pair of objects, so its computation time is the worst. In comparison with PTDBF, both $k$ -skyband based methods, PTDSky and PTDMUS, need less computation time since $k$ -skyband can utilize the characteristics of R-trees and MBRs for precluding irrelevant objects effectively. When the dimensionality becomes large ( $d=9$ ), PTDSky and PTDBF have similar performance in transmission time. On the other hand, PTDMUS has the best computation time and outperforms PTDSky and PTDBF by more than 85% when $d=9$ .

As Fig. 5(b) shown, PTDSky and PTDMUS have very similar performance in average computation cost when $d\leq 7$ . In this simulation, the size of the given uncertain data set $|U|$ is 10,000. We can observe that PTDSky and PTDMUS need to transmit more than 9,000 candidate objects when $d\leq 7$ . Such a phenomenon indicates that the score of an object dominating another objects decreases significantly, so the number of candidate objects becomes large and near to $|U|$ . In the case of $d>7$ , the coordinator node in PTDMUS can record the minimum checking time of more than 9,000 objects and the minimum checking time table can help coordinator node avoid transmitting the information of irrelevant objects to monitor nodes at some time slots (runs). Hence, the average transmission cost of PTDMUS can be improved nearly 20% in such a scenario ( $d=9$ ).

Even using a predictive mechanism to reduce the frequency of updating candidate objects, Fig. 5(c) and Fig. 5(d) show that PTDMUS can achieve almost the same performance on precision and recall as PTDSky does. In comparison with PTDSky, PTDMUS only loses 0.001% performances on both precision and recall for $d=7$ . In most applications, such a tiny lose of performance can be recognized as a tolerant error.

VI-C Number of Monitor Nodes

The considered environment is implemented in a parallel model. We now discuss the performance of each method in different scenarios with various numbers of monitor nodes. Note that there are no results of PTDBF in this part since PTDBF is a centralized method. Fig. 6(a) shows that both PTDMUS and PTDSky need less computation time if the number of monitor nodes increases. With the minimum checking time mechanism, PTDMUS precludes irrelevant objects more effectively than PTDSky does. Thus, PTDMUS outperforms PTDSky in computation time by almost 60%. Fig. 6(b) indicates that the average total transmission costs between monitor and coordinator nodes in PTDMUS and PTDSky are increasing as monitor nodes become more. In this simulation, we fix the size of sliding window in each monitor node, so the transmission cost is related to $m\times|SW_{j}|$ . With the minimum checking time table, PTDMUS can save about more than 2,000 transmission cost (objects) under various number of monitor nodes.

According to the simulation results in Fig. 6(c) and Fig. 6(d), PTDMUS is only $0.01\%$ worse than PTDSky on both accuracy and recall in different scenarios with different number of monitor nodes. Again, such a tiny performance gap can be recognized as a tolerant error for most applications. In summary, in comparison with PTDSky, PTDMUS reduces the average computation time and the transmission cost significantly while maintaining nearly identical accuracy and recall.

VI-D Size of Sliding Window

In this part, we discuss the effect of different sizes of sliding windows $|SW_{j}|$ on monitor nodes. In Fig. 7(a), it shows that all the methods have better computation time performance when $|SW_{j}|$ is relatively small ( $|SW_{j}|=240$ ) or large ( $|SW_{j}|=960$ ). For each monitor node, the sliding window can be recognized as a buffer and it is used to save the data objects that need to be processed. In general, the computation cost will increase as the size of sliding window becomes big. It was shown in Fig. 7(a) that the computation time of all the methods is significantly reduced if $|SW_{j}|=960$ . The reason is that the coordinator node records a large candidate set and the upper bound of its size is $m\times|SW_{j}|$ . In the case of $|SW_{j}|=960$ , the coordinator node will record 9,600 data objects at most and it approaches to the size of the given data set $|U|=10,000$ . According to (8), all the methods only need to execute $\varDelta t=40$ runs (slots) and each monitor node deals with only one new input object at every time slot. In such a scenario, the score for the new input object to become the candidate object is very low. Thus, the computation cost of updating the global candidate on the coordinator node also significantly decreases. In the case of $|SW_{j}|=240$ , the reason for all the compared methods to have a good computation cost is that the small $|SW_{j}|$ makes the computation time of each run (slot) very fast. In summary, PTDMUS has the best performance on computation cost in all the considered cases with different sizes of sliding windows. As shown in Fig. 7(b), PTDMUS needs lower transmission cost than PTDSky does in all the scenarios with different sizes of sliding windows. PTDSky has a similar performance to PTDMUS in transmission cost only when $|SW_{j}|=480$ but PTDMUS is still better.

Fig. 7(c) and Fig. 7(d) show that both PTDMUS and PTDSky achieves 99.998% precision and recall when $|SW_{j}|$ is 720 or 960. If $|SW_{j}|=480$ , the performance gap between PTDMUS and PTDSky is smaller than 0.01% in terms of precision and recall. If $|SW_{j}|=240$ , PTDMUS loses about 0.126% precision and 9.8% recall in comparison with PTDSky. However, such a high precision (99.87%) performance provided by PTDMUS is still allowable for most applications except for financial and emergency services.

VI-E Value of k

Since we consider the top- $k$ dominating query, various values of $k$ may affect the performance. Fig. 8(a) shows that the performance of all the methods in terms of computation time are independent of the value of $k$ . In the case of a high dimensional data set ( $d=9$ ), PTDSky is slightly worse than PTDBF since the coordinator node wastes too much computation time in computing the global probabilistic threshold-based $k$ -skyband with too many irrelevant objects. Such a similar result has been presented in Fig. 5(a). Conversely, PTDMUS improves more than 80% computation time comparing to PTDSky and PTDBF. From Fig. 8(b), we can observe that the value of $k$ is also independent of the transmission cost for both PTDMUS and PTDSky. The reason is that both PTDMUS and PTDSky use a threshold $\delta$ to preclude the irrelevant objects, where $\delta$ is much smaller than $k$ . In addition, PTDMUS can save almost 20% transmission cost due to the usage of the minimum checking time.

Fig. 8(c) and Fig. 8(d) show that PTDMUS has the same trend in precision and recall. The precision and recall of PTDMUS slightly increase as the value of $k$ increases. When $k=200$ , PTDMUS can achieve $99.9985\%$ precision and recall. Although PTDSky also has the same precision and recall, PTDSky performs better than PTDMUS in precision and recall as the value of $k$ becomes smaller. PTDSky achieves 100% precision and recall when $k\leq 50$ . If $k>150$ , PTDMUS will achieve better precision and recall than PTDSky does.

VI-F Margin of Uncertainty

We last discuss the effect of object’s margin of uncertainty $M$ , which is also called the object size. In general, the MBR of an uncertain data object becomes large as $M$ increases and thus the occurrence of partial dominance will increase. As shown in Fig. 9(a), PTDMUS has the best computation time performance and with 60% improvement in comparison with PTDSky and PTDBF. PTDSky and PTDBF have similar computation time for $M\leq 240$ . When $M>240$ , PTDBF becomes the worst one due to the large occurrence of partial dominants. In Fig. 9(b), it is shown that the margin of uncertainty $M$ is independent to the size of the candidate set. Thus, the transmission costs of PTDMUS and PTDSky are not affected by the margin of uncertainty.

According to the results in Fig. 9(c) and Fig. 9(d), the precision and recall of the query result provided by PTDMUS linearly increase as the margin of uncertainty $M$ increases. On the other hand, the precision and recall of PTDSky’s query result increase more significantly as $M$ becomes larger. PTDSky can provide the query result with higher precision and recall only if $160\leq M<320$ . For $M\leq 80$ , PTDMUS will has better precision and recall than PTDSky by more than 0.0025%.

VII Conclusion

In this paper, we have presented a new approach for Probabilistic Top- $k$ Dominating query over Multiple Uncertain data Streams (PTDMUS) to improve the computation efficiency of probabilistic top- $k$ dominating query for Edge-IoT applications. With the parallelism, the monitor nodes use the value of $k$ and threshold-based probabilistic $k$ -skyband to preclude most of the irrelevant objects in advance, thereby significantly reducing transmission cost. The coordinator node caches the temporary result and uses the proposed approach, minimum checking time, to reduce the frequency of computing the dominant score of each object in the cache table. Such a way can effectively minimize the computation time and incrementally update the result of the probabilistic top- $k$ dominating query with less update frequency. The simulation results show that PTDMUS can improve the computation performance effectively, while keeping good precision and recall of result.

In the future, we are going to apply PTDMUS to mobile edge computing frameworks for making the multi-criteria decision on the dynamic placement of drone base stations, thus providing reliable communication services for specific purposes and scenarios.

Acknowledgment

This research is partially supported by Ministry of Science and Technology under the Grant MOST 107-2221-E-027-099-MY2 and MOST 108-2634-F-009-006- through Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Tanganelli, C. Vallati, and E. Mingozzi, “Edge-centric distributed discovery and access in the internet of things,” IEEE Internet of Things Journal , vol. 5, no. 1, pp. 425–438, Feb. 2018.
2[2] A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi, “Internet of things for smart cities,” IEEE Internet of Things Journal , vol. 1, no. 1, pp. 22–32, Feb. 2014.
3[3] C. M. Huang, C. H. Shao, S. z. Xu, and H. Zhou, “The social internet of thing (s-iot)-based mobile group handoff architecture and schemes for proximity service,” IEEE Transactions on Emerging Topics in Computing , vol. 5, no. 3, pp. 425–437, Jul. 2017.
4[4] H. El-Sayed, S. Sankar, M. Prasad, D. Puthal, A. Gupta, M. Mohanty, and C. Lin, “Edge of things: The big picture on the integration of edge, iot and the cloud in a distributed computing environment,” IEEE Access , vol. 6, pp. 1706–1717, 2018.
5[5] J. Pan and J. Mc Elhannon, “Future edge cloud and edge computing for internet of things applications,” IEEE Internet of Things Journal , vol. 5, no. 1, pp. 439–449, Feb. 2018.
6[6] K. Gai, M. Qiu, H. Zhao, L. Tao, and Z. Zong, “Dynamic energy-aware cloudlet-based mobile cloud computing model for green computing,” Journal of Network and Computer Applications , vol. 59, pp. 46 – 54, Jan. 2016.
7[7] K. Gai, M. Qiu, and H. Zhao, “Energy-aware task assignment for mobile cyber-enabled applications in heterogeneous cloud computing,” Journal of Parallel and Distributed Computing , vol. 111, pp. 126 – 135, Jan. 2018.
8[8] K. Gai and M. Qiu, “Reinforcement learning-based content-centric services in mobile sensing,” IEEE Network , vol. 32, no. 4, pp. 34–39, Jul. 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Probabilistic Top-k Dominating Query Monitoring over Multiple Uncertain IoT Data Streams in Edge Computing Environments

Abstract

Index Terms:

I Introduction

II Related Work

III Preliminaries

III-A Fundamental Assumptions

Definition 1** (Uncertain Data Objects).**

Definition 2** (Sliding Window).**

Definition 3** (Instance-level Dominance).**

Definition 4** (Object-level Dominance).**

Definition 5** (Dominant Score of an Object).**

III-B System Architecture

III-C Problem Statement

IV Probabilistic Top-k Dominating Query Process over Multiple Uncertain Data Streams (PTDMUS)

IV-A The Computation with R-trees

Definition 6** (Dominance between different MBRs).**

IV-B Threshold-based Probabilistic k-skyband

Definition 7**.**

Definition 8** (Probabilistic k-skyband).**

Theorem 1**.**

Proof.

Definition 9** (Threshold-based Probabilistic k-skyband).**

IV-C Minimum Checking Time

Theorem 2** (Minimum Checking Time).**

Proof.

IV-D The Process of PDTMUS

V Complexity Analysis

V-A Time Complexity

V-B Space Complexity

V-C Transmission Cost

VI Simulation Results

VI-A Threshold

VI-B Data Dimensionality

VI-C Number of Monitor Nodes

VI-D Size of Sliding Window

VI-E Value of k

VI-F Margin of Uncertainty

VII Conclusion

Acknowledgment

Definition 1 (Uncertain Data Objects).

Definition 2 (Sliding Window).

Definition 3 (Instance-level Dominance).

Definition 4 (Object-level Dominance).

Definition 5 (Dominant Score of an Object).

Definition 6 (Dominance between different MBRs).

Definition 7.

Definition 8 (Probabilistic k-skyband).

Theorem 1.

Definition 9 (Threshold-based Probabilistic k-skyband).

Theorem 2 (Minimum Checking Time).