Probabilistic Top-k Dominating Query Monitoring over Multiple Uncertain IoT Data Streams in Edge Computing Environments
Chuan-Chi Lai, Tien-Chun Wang, Chuan-Ming Liu, Li-Chun Wang

TL;DR
This paper presents a parallel probabilistic top-k dominating query method for uncertain IoT data streams in edge computing, significantly improving processing speed, reducing communication costs, and maintaining high accuracy.
Contribution
It introduces a novel parallel approach for top-k dominating queries over uncertain IoT streams, optimizing for speed, cost, and accuracy in edge environments.
Findings
Improves computation time by nearly 60%
Reduces communication cost by about 20%
Maintains high accuracy in most scenarios
Abstract
Extracting the valuable features and information in Big Data has become one of the important research issues in Data Science. In most Internet of Things (IoT) applications, the collected data are uncertain and imprecise due to sensor device variations or transmission errors. In addition, the sensing data may change as time evolves. We refer an uncertain data stream as a dataset that has velocity, veracity, and volume properties simultaneously. This paper employs the parallelism in edge computing environments to facilitate the top-k dominating query process over multiple uncertain IoT data streams. The challenges of this problem include how to quickly update the result for processing uncertainty and reduce the computation cost as well as provide highly accurate results. By referring to the related existing papers for certain data, we provide an effective probabilistic top-k dominating…
| Object | Instance | Object | Instance |
|---|---|---|---|
| Parameter | Default Value | Range (type) |
|---|---|---|
| Number of data objects, | - | |
| Number of instances, | - | |
| Dimension, | ||
| Space of an attribute | - | |
| Number of monitor nodes, | ||
| Size of a local sliding window, | ||
| Size of the global sliding window, | ||
| Degree of R-tree, | - | |
| Threshold, | ||
| Margin of Uncertainty, | ||
| Distribution | Uniform | - |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Probabilistic Top-k Dominating Query Monitoring over Multiple Uncertain IoT Data Streams in Edge Computing Environments
Chuan-Chi Lai, Tien-Chun Wang, Chuan-Ming Liu, and Li-Chun Wang C.-C. Lai and L.-C. Wang are with Department of Electrical and Computer Engineering, National Chiao Tung University, 300 Hsinchu, Taiwan (E-mail: [email protected]; [email protected]). T.-C. Wang is with Department of mobile device development, Compal Electronics, Inc., Taiwan (E-mail: [email protected]). C.-M. Liu is with Department of Computer Science and Information Engineering, National Taipei University of Technology, 10618 Taipei, Taiwan (E-mail: [email protected]).
Abstract
Extracting the valuable features and information in Big Data has become one of the important research issues in Data Science. In most Internet of Things (IoT) applications, the collected data are uncertain and imprecise due to sensor device variations or transmission errors. In addition, the sensing data may change as time evolves. We refer an uncertain data stream as a dataset that has velocity, veracity, and volume properties simultaneously. This paper employs the parallelism in edge computing environments to facilitate the top-k dominating query process over multiple uncertain IoT data streams. The challenges of this problem include how to quickly update the result for processing uncertainty and reduce the computation cost as well as provide highly accurate results. By referring to the related existing papers for certain data, we provide an effective probabilistic top-k dominating query process on uncertain data streams, which can be parallelized easily. After discussing the properties of the proposed approach, we validate our methods through the complexity analysis and extensive simulated experiments. In comparison with the existing works, the experimental results indicate that our method can improve almost 60% computation time, reduce nearly 20% communication cost between servers, and provide highly accurate results in most scenarios.
Index Terms:
Big Data, Internet of Things, Uncertain Data, Multiple Data Streams, Top-k Dominating.
I Introduction
Big data analysis has been widely applied in many fields in recent years. The well-known characteristics of big data are the following Vs: Volume, Velocity, Variety, Veracity, Variability, and Value. Many modern applications and services need to deal with big data from multiple sources. Such a way can be recognized as a computing model over multiple uncertain data streams. For example, some specific applications, Massive Internet of Things (Massive IoT) [1], Smart City [2], and Location-Based Service (LBS) [3], can be recognized as the implementations of a distributed/parallel sensing data processing model with multiple input uncertain data streams. The afore mentioned applications match at least three big data’s characteristics: volume, velocity, and veracity. The volume of information is growing all the time so that an efficient parallel or distributed computing way is required. In massive IoT environments, the real-time monitoring is a typical application for detecting the events that need to be avoided or alleviated. In this case, the users/operators only concern the latest results for most queries and thus the information has time-limited (or velocity) feature. Due to the unreliability of data retrieval process, many data are inaccurate or uncertain. In such a case, the probabilities are used to represent the distribution of different situations. Hence, a massive IoT application has to effectively process the uncertain data to provide near real-time results with high precision (or veracity).
Although the big data can be resolved by the Cloud Computing model, the response time (or latency) still can not meet the requirements of some near real-time IoT monitoring applications. Edge Computing [4, 5] thus has become the promising architecture to improve the response time for IoT applications in recent years. Most researchers focus on developing new techniques to edge computing from system design, communications, networking, and resource management [6, 7, 8] aspects. However, developing new effective techniques to process the IoT data efficiently from data science/engineering aspects is also very important and helpful to the IoT applications. Many researchers thus have proposed some algorithms for different types of queries (demands) to find the insightful knowledge in big data and make the precise decision. Skyline [9, 10, 11] and Top- [12, 13, 14] queries are common research topics. However, the skyline and top- queries lead to some discrepancies in the search results. Nowadays, such two queries cannot satisfy the demand of some modern applications. Therefore, a new query, Top- Dominating [15, 16, 17, 18], in certain data combined the above two search features comes into being.
In general, an uncertain data object is usually modeled with multiple probabilities which represents the probabilities of the object’s occurrences or errors for some applications, such as IoT data analysis. Such a data model makes the query process much more complicated. Some works [19, 20, 21] have discussed the Probabilistic Top- Dominating (PTD) query processing on uncertain data streams. In traditional, to handle probabilistic top- dominating queries, the system will compute the dominant scores between different data objects and find out objects having the highest dominant scores. Such a straightforward process needs computation time, where is the number of instances in an object and is the input data set. As the amount of data increases dramatically, the system needs solutions to effectively reduce the computational complexity. In the traditional centralized systems, R-tree [22] is one of the most popular indexing structure to improve the performance of query processing. Due to the spatial characteristic of R-tree, the system can get a great performance improvement on following operations: object search, value comparison, and pruning. However, utilizing centralized data structures and algorithms already can not handle the big data lead by the IoT era. Therefore, it is reasonable to improve the efficiency of computations using modern parallel and distributed computations. In this paper, we propose a Probabilistic Top- Dominating query process over Multiple Uncertain data Streams (PTDMUS) algorithm to improve the efficiency of searching data objects that have the highest dominate scores for the distributed real-time IoT monitoring applications. The contributions of this work are listed as follows.
- •
We provide a parallel processing model utilizing the R-trees, -skyband [13], and a threshold for effectively precluding irrelevant objects in advance, and thereby significantly reduce the computational overhead. Such an idea can be used to solving some other similar types of queries.
- •
We propose an estimated theorem for the distributed computing environments to effectively predict the time that a data object has the chance to become the final result, and consequently decreases the frequency of dominance checks on the edge computing nodes.
- •
In addition, we present the theoretical analysis of PTDMUS on time complexity, space complexity, and transmission cost in the average and worst cases.
- •
The simulation result indicates that PTDMUS outperforms the conventional method with 60% computation time and 20% transmission cost while keeping near 100% precision and recall of final query result in most scenarios.
The rest of paper is organized as follows. Some related researches are reviewed in Section II. Section III presents the definitions, notations, and problem statement of this work. Section IV discusses the proposed solutions with some algorithms and running examples in details. Some theoretical analysis and discussion are explained in Section V. Simulation results are presented in Section VI. Finally, we give concluding remarks in Section VII.
II Related Work
Many researchers have discussed range, skyline, and top- queries over uncertain data in distributed computing environments. Nowadays, the above query types can not satisfy the demand of some modern applications. Hence, we focus on a more complex query, top- dominating query, in this work. In the balance of this section, we introduce the related works about top- dominating query processing from the data science aspect. The comparisons of conventional works are summarized in Table I and each work will be described in followings.
Miao et al. [15] proposed a Bitmap Indexing Guided (BIG) algorithm for improving the performance of processing top- dominating query on large incomplete dataset. Han et al. [16] provided a table-scan-based method with presorted results for improving the performance/efficiency of top- dominating query computations on massive data in batch computing model. Amagata et al. [17] mapped multiple input datasets into a data space and then proposed a method which generates virtual points for effectively precluding unnecessary data objects in the data space. Ezatpoor et al. [18] applied BIG algorithm [15] to MapReduce framework for providing a parallel computing model to enhance the performance of processing top- dominating query on large incomplete dataset. However, only [17] and [18] proposed the algorithms for distributed computing environments. Furthermore, the above approaches for certain data did not support continuous query processing in real-time IoT monitoring applications.
For uncertain data, only few studies [21, 20, 19] have explored the top- dominating query processing until now. Zhang et al. [19] proposed a threshold-based algorithm to prune the irrelevant objects and thus improved the performance of computation for top- dominating query. Zhan et al. [20] developed new pruning techniques by utilizing the spatial indexing and statistic information while considering the maximum/upper and minimum/lower bounds of probabilistic dominance, which reduced computational and I/O costs. Li et al. [21] proposed a method to postpone the unnecessary calculation if the query results did not change dramatically in a certain period of time and the computational cost could be reduced. However, these works did not consider how to process continuous queries over uncertain data with parallelisms for real-time IoT monitoring applications based on edge computing environments.
In summary, to the best of our knowledge, none of existing works simultaneously consider following characteristics: uncertain data, continuous probabilistic top- dominating query, distributed computing, and the real-time requirement for IoT monitoring. This shows that probabilistic top- dominating query processing over uncertain data for edge-enabled IoT real-time monitoring applications remains a big challenge.
III Preliminaries
In this section, we introduce the fundamental assumptions, the system model, and the problem statement.
III-A Fundamental Assumptions
Three kinds of uncertain data models have been proposed and discussed in [23]: fuzzy, evidence-oriented, and probabilistic models. In this work, we refer to the last model with discrete case and the uncertain data object can be defined as Definition 1.
Definition 1** (Uncertain Data Objects).**
Given a -dimensional uncertain data set , each uncertain data object with instances is a probability distribution over the -dimensional space. Each instance of has attributes, , where , and is associated with a probability , where .
A simple example of a two-dimensional uncertain data set is presented in Table II, in which each uncertain data object has three possible instances. For example, object has three instances , , and with probabilities , , and , respectively. It means that may occur in three possible cases with different corresponding probabilities and the total probability of all cases will be 1. Note that we will use attribute or dimension interchangeably.
If we map the instances of the uncertain data objects onto a -dimensional space, each uncertain data object can be represented by a minimum bounding rectangle, MBR(), which is the minimum rectangle containing all the instances of in the space. Let and respectively denote the maximum and minimum corners of where and , where . Then, MBR() can be represented by where and . Note that is the probability value of instance . According to the example in Table II, and , so MBR() is . Fig. 1 shows the MBRs of each data objects indexed by an R-tree for the example in Table II. The four uncertain data objects , , , and , on a 2D plane with the associated MBRs and each object has three instances respectively. Note that we use the bulk loading algorithm [24] to construct the R-trees [22] in our work since it can utilize the space and avoid the overlapping issue between MBRs, thus improving the query time.
We consider multiple uncertain data streams and denote an Uncertain data Stream as US, where the uncertain data objects are generated with time and will be invalid after a period of time. Data streams play an important role in the era of big data with the advance of IoT technology and have attracted much attention for years. Most of the related researches use the sliding window model and focus on the recent data in the stream. Our work follows this trend.
In a data stream, each data object has a time stamp to denote the time for entering the system. This scenario is usually modeled as the sliding window and it can be defined as below.
Definition 2** (Sliding Window).**
Suppose the sliding window, SW, is of size . Then the data newly generated will be invalid after time instances. We use to denote the set of the uncertain data objects in the current sliding window at time . The considered sliding window follows the first-in-first-out rule for keeping the objects.
In this paper, we use to denote the uncertain data objects in SW (i.e., SW= , according to the arrival time of each data object. To search the top- dominating objects, the system will use the dominant scores obtained from the skyline query. In our work, we assume that the value of a dimensional attribute is the smaller, the better. The dominant relations between instances can be defined as follows.
Definition 3** (Instance-level Dominance).**
Given two uncertain data instances of two different uncertain data objects and where and , if the condition holds, we say dominates and it is denoted as .
In short, none of ’s attributes is better (smaller and except for equal) than ’s the corresponding dimensional attribute. Since a data object may has multiple instances, we can classify the object-level dominance into three cases. The relevant definitions are presented in the following.
Definition 4** (Object-level Dominance).**
Suppose there are two uncertain data objects and and each object has instances. If is considered as a dominator, the relation between and can be classified by using following cases:
Complete Dominance: all the instances of dominate all the instances of , denoted as . 2. 2.
Partial Dominance: some instances of dominate some instances of , denoted as . 3. 3.
Missing Dominance: no instances of dominate any instance of , denoted as .
In summary, the probability of dominating can be generally expressed as
[TABLE]
According to the above definitions, we can derive the score of the dominant relation between two objects according to the following definition.
Definition 5** (Dominant Score of an Object).**
Given an uncertain data object with instances, the expected dominant score of an instance can be derived by
[TABLE]
Then, the dominant score of the uncertain object is defined as
[TABLE]
Consider the example in Table II and Fig. 1, where instances , , and dominate the following instances: , , , , , and by Definition 3. In other words, by Definition 4, object completely dominates objects and , denoted as and . The derivation of can be presented as
[TABLE]
For object , it completely dominates object and partially dominates object , so the calculation of will be
[TABLE]
Consequently, we can obtain the dominant scores of all the uncertain objects in the same way.
III-B System Architecture
In this work, we construct an edge computing system as shown in Fig.2 and make it support the parallel and distributed computing for monitoring the top- query over multiple uncertain IoT data streams. Such a way can improve the efficiency of computation. The system consists of a coordinator node (cloud service) and monitor nodes (edge computing nodes) . Each monitor node can directly contact with the coordinator node , where . For , all the reported information from each is recognized as an uncertain data steam . Each needs to continuously compute the local result of the query and upload it to as the candidate result. needs to record all the unexpired candidates that are received from each .
III-C Problem Statement
Our objective is to have a time-efficient approach determining the uncertain objects with top- dominant scores among all the uncertain objects in the considered system model. A global sliding window and uncertain data streams are given. Each is corresponding to the monitor node where . Each has its local and . Each examines the objects in , saves the possible objects in a local candidate set, and then reports the local candidate set to the coordinator node . uses the received local candidate sets to calculate the global candidate set and then broadcasts it to each . Each uses the received global candidate set to derive the dominant scores of the objects that dominate others and then returns the scores to . After that, integrates the received score information of each object and finds out data objects that have the highest scores. The final result set including Probabilistic Top- Dominating objects is denoted as . Note that the above process are repeatedly operated until there is no input data.
According to the above assumptions, there are three important issues to be solved:
How to avoid the unnecessary computations for the dominant scores in order to save the computation time? 2. 2.
How to minimize the number of local candidate objects for improving the transmission cost? 3. 3.
How to reduce the frequency of dominant score derivations as time evolves?
IV Probabilistic Top-k Dominating Query Process over Multiple Uncertain Data Streams (PTDMUS)
In this section, we present the proposed approach, Probabilistic Top-k Dominating Query Process over Multiple Uncertain Data Streams (PTDMUS). PTDMUS provides three mechanisms to solve the above three issues and we respectively introduce each of them in detail.
IV-A The Computation with R-trees
In the first part, we apply R-trees to the considered system for improving the computational speed of the dominant score derivation. Note that we use [25] to generate bulk loading R-trees and minimize the overlaps of elements in each level, thereby optimizing the searching time. By combining the characteristics of an MBR in Definition 4, we can define the dominant relation between different MBRs as Definition 6.
Definition 6** (Dominance between different MBRs).**
Given two different minimum bounded rectangles , of object and respectively, if we consider as a dominator, we can classify the relation between and as following cases:
Complete Dominance: of is smaller than of of , denoted as . 2. 2.
Partial Dominance: of is smaller than of of , denoted as . 3. 3.
Missing Dominance: of is larger than of of , denoted as .
Using the cases in Definition 6, the system can preclude irrelevant objects effectively and thus improve the computation overhead for dominant scores. For example, to derive the dominant score of , the system uses MBR() and the R-tree as inputs. The system will put all the children of the root in a Target Set () and then examine the relation between MBR() and each the element in :
Complete Dominance: if is an MBR node, add the number of objects in to ; otherwise, is an object and . 2. 2.
Partial Dominance: if is an MBR node, put all the children of in the Next Target Set (); otherwise, is an object and directly is added to the dominant score of with respect to . 3. 3.
Missing Dominance: do nothing.
After examining all the elements in , if is not empty, the system will clear and insert all the elements of to . The system will do the above operations repeatedly until is empty and can obtain the dominant scores of with respect to all the objects in the R-tree. The above computation process with R-tree will be executed on monitor nodes in PTDMUS.
IV-B Threshold-based Probabilistic k-skyband
To reduce the number of candidate objects, [13] proposed a -skyband approach for the top- query on certain data. In our work, we follow this idea to define a probabilistic -skyband for minimizing the size of candidate set. First, we define the dominated score of an uncertain object as follows.
Definition 7**.**
Given an uncertain data object in the sliding window , the score of an instance being dominated (or called dominated score) is defined as
[TABLE]
where is an instance of , and the dominated score of is
[TABLE]
After obtaining the dominated score of each object, we can define the probabilistic -skyband () as follows.
Definition 8** (Probabilistic k-skyband).**
For a given integer , the probabilistic -skyband () is a set of uncertain data objects and
Note that the top- dominating result is always a part of the -skyband [26] in certain data. For uncertain data, we can also have a similar property as shown in Theorem 1
Theorem 1**.**
Given a set of probabilistic top- dominating objects , , if , then .
Proof.
If , then - according to Definition 8. In this case, at least other objects dominate in average. It is hence impossible for to be one of the top- dominating objects and . This contradicts to the given condition and the proof is done. ∎
In order to use the property in Theorem 1 for processing probabilistic top- dominating queries in parallel, we derive Threshold-based Probabilistic -skyband by giving a threshold .
Definition 9** (Threshold-based Probabilistic k-skyband).**
For a given integer and a threshold value , the threshold-base probabilistic -skyband (TKS) is a subset of and
This method is used to solve the second issue we mentioned in the problem statement and reduce the size of the candidate set in each monitor node. In this way, the coordinator node can also process less received candidate objects that are possible to be the top- dominating objects. In conventional methods [26, 27], the -skyband is computed on both monitor and coordinator nodes. However, in the most modern big data applications, such a way is not efficient since the volume of candidates are usually still too large for the top- dominating objects considered by users. The coordinator node still needs too much computational cost on the -skyband calculation and makes the response time intolerable to users. Hence, in the proposed PTDMUS approach the coordinator node uses a new mechanism, Minimum Checking Time (MCT), to help efficiently derive the final result instead of computing the global threshold-based probabilistic -skyband as the candidate set, . Note that and is the local result of threshold-based probabilistic -skyband from the monitor node . Such a mechanism can help the coordinator node process the objects that are really relevant to the query, decrease the frequency of score derivation on irrelevant objects, and significantly reduce the computational cost as well as improve the response time.
IV-C Minimum Checking Time
We discover an important phenomenon that each monitor node usually uploads the candidate set that is very similar to the previously uploaded one in most scenarios. In other words, the received local candidates from each monitor node do not often change dramatically as time moves. Therefore, we can record the statuses of candidate objects to make each monitor node only upload the objects that need to be updated. Such a way can alleviate the transmission cost mentioned in the second issue.
In this paper, we use a table, checking-time table (), to record the statuses of received data objects on the coordinator node . The status of each received data object will be stored in one entry of until the lifetime of is out. Note that the lifetime of a data object is equal to the length of ’s sliding window . The coordinator node thus only needs to update the status of data objects in if necessary, and then calculates the final result. In general, most of the objects will not be in the final result. Hence, we can use a predictive way to determine the minimum checking time for the coordinator node. With and the minimum checking time derivation, the server can only do the computation if the result will change. Such an idea comes from the conventional work [21] in centralized database systems. We thereby propose a new distributed version theorem for dynamically determining the minimum checking time to update the result set in distributed environments.
Theorem 2** (Minimum Checking Time).**
Suppose that the notations are defined as above, the minimum checking time for the coordinator node represents the lower bound of expected time that the result set of probabilistic top- dominating objects will change, and it can be derived by
[TABLE]
where is the nearest (smallest) expired time of an object in the set of , is the -th highest dominant score of the objects in , is current time.
Proof.
When and , has a chance to be in if one of following two cases is satisfied:
some objects in are expired; 2. 2.
.
Our objective is to obtain the minimum time that can be in . For Case 1, the process will search the minimum expired time of all objects in and it is depicted as . For Case 2, if object is the object with -th highest dominant score in and object is going to be removed from . Removing from will reduce the difference gap, , between and . It means that some objects in result in . In general, there could be many old objects like . The system needs to remove at least old objects to remain the minimum number of necessary objects in , and then holds. Since each run of the computation can remove old objects like , the minimum time period needs to be divided by and it will be . After that, add the obtained minimum time period to the current time and thus get the predicted time, , that . Finally, the minimum checking time is updated by with respect to an object , which is denoted as . ∎
While computing of each object , if the obtained time is equal to , it means that has a chance to become the result in this run but the dominant score of is not large enough to make . Then, if during the next run of computation, the system will record each in the checking-time table and set for the next run of computation. In summary, with the proposed theorem, the checking-time table acts as a priority cache table and helps the effectively reduce the frequency of computation for the third issue.
IV-D The Process of PDTMUS
In fact, the overall process of PDTMUS has been briefly described in Section III-C. In this subsection, we show the whole process using the pseudo-code in Algorithm 1. The system executes Algorithm 1 recursively until no input data coming in. At Line 1 of Algorithm 1, each monitor node inserts the received data objects in to the local sliding widow at time and removes the oldest data objects in due to the size limitation of a local sliding widow. At Line 1, each pre-processes all the local objects in for constructing the local R-tree as well as obtaining the information of dominant relations between MBRs using Definition 6. Each then computes the local candidate (local -skyband) set with Definition 5, Definition 7, and Definition 9. Note that and is the local result of threshold-based top- dominating objects from the monitor node . If holds at Line 1, it means that the whole precess is in the initial phase and each at Line 1 will upload the whole candidate set to the coordinator node ; otherwise, each only needs to upload the necessary update information to at Line 1.
After receives each local candidate set from each in , derives the global candidate set in the same way (using Definition 7 and Definition 8) at Line 1. The coordinator node then broadcasts the global candidate set to every at Line 1 and asks for helping the local computation. Each derives the dominant and dominated scores, and -, of all the objects in and uploads the updated to using Definition 9 at Lines 1 and 1. From Lines 1 to 1, uses the received information of dominated scores to update , finds the final global result for time , and then updates the minimum checking times of the objects that may be the answer at time . broadcasts the information of checking time in to every at Line 1 using Theorem 2. With the checking time table , each can determine the appropriate time of the next round of update/derivation and thus effectively reduce the frequency of computation. Such a way can avoid a lot of unnecessary computation. In the last, the system returns as the final result to the user.
V Complexity Analysis
After introducing the proposed process of PTDMUS, we analyze and discuss its time complexity, space complexity, and transmission cost in both the average case and the worst case, respectively.
V-A Time Complexity
In the first run (time slot ) of the PTDMUS process, mentioned in the previous section, each monitor node takes time on constructing a local R-tree, , with all the data objects in at the initial step, deriving and - of each in at the second step, and extracting the threshold-based top- dominating objects into at the last step. Hence, the time complexity of PTDMUS on a monitor node can be expressed as
[TABLE]
In PTDMUS, the time complexity is related to maintaining and searching the R-trees. According to [28] [29], the time for constructing a -dimensional R-tree is where is the block (or page) size of data on the disk (or memory), is the degree fanout of R-tree. In this work, we deal with the uncertain data objects in a object-oriented model (), so the time for local R-tree’s construction will be
[TABLE]
In the considered environment, we assume that the data points are uniformly and independently distributed in the domain space . To make it simple to analysis, we normalize the space into . According to [30], ’s height and the number of nodes at level (let the leaf level be [math]) will be approximately and , respectively. Besides, the extent (i.e., length of any 1D projection) of a node at the -th level can be estimated by and some nodes in the -th level may be partially dominated by . Fig. 3(a) shows that the gray region corresponds to the maximal region, covering nodes (at level ) that are partially dominated by . Then, the average number of required node accesses in the R-tree for computing the dominant score of object will be [31]
[TABLE]
where is the value of after the 1D projection and is the number of instances in an object. Hence, the time complexity of dominance update on the monitor node can be expressed as
[TABLE]
To obtain the local threshold-based probabilistic -skyband, the monitor node also needs to traverse the to derive the - of each object in the . Fig. 3(b) shows that the gray region corresponds to the maximal region, covering nodes (at level ) that partially dominate . The average number of required node accesses in the R-tree for computing the - of object will be
[TABLE]
where is the value of after the 1D projection. Hence, the time complexity of -skyband update on a monitor node can be derived by
[TABLE]
That is, with (4) and (5), the time complexity of the second step on will be
[TABLE]
where . In the last step, will copy all the objects in to a temporary candidate list , sort the objects in decreasing order by using merge-sort where , and then use the threshold to extract the local probabilistic -skyband. Therefore, the time complexity of can be denoted as
[TABLE]
In summary, with (3) to (V-A), we can express (V-A) as
[TABLE]
where .
Note that PTDMUS needs to monitor the result of the top- dominating query continuously in a monitoring time period and is set by
[TABLE]
After the first run (time slot), the coordinator node will broadcast the global candidate set at time with the minimum checking time, , to each monitor node . Each can use the received information to reduce the computational overhead during the next computation of local result when . In the following time slots, uses the candidate set of previous run, , to construct the global R-tree, , for dominance checks instead of using . In practice, we use two temporary lists, and , to help the update of candidate set during the time period . is used to record the objects that are going to be deleted where . is used to stored the new input objects that are going to be added where . Thus, the exact data set that needs to be processed at time becomes and uses to construct the new local for computing . Using to substitute with (3) to (V-A), the time complexity of a derivation run on at time can be obtained by
[TABLE]
where . In summary, the average complexity during the time will be
[TABLE]
In fact, the derived costs in PTDMUS and PTDSky methods are similar since both of them use monitor nodes to derive the local -skybands. From (V-A), we can know that the computation time is significantly influenced by the computation cost of each run when . PTDMUS uses the minimum checking time to reduce the frequency of derivations (or dominance checks) where . If , each monitor node in PTDMUS and PTDSky will have similar computation time . The worst case only occurs when and always receives the global candidate set . In such a scenario, the set ’ needed to be process at each time slot will be ’ and ’ will become very large. To obtain the upper bound of the time complexity, , on a monitor , we can use and substitute ’ for in (V-A) and (V-A).
After analyzing the time complexity on a monitor node, time complexity on the coordinator node, , also needs to be discussed. However, depends on the size of global candidate set , so we will discuss after analyzing the space complexity on in the next subsection.
V-B Space Complexity
In the considered parallel computing model, the size of global candidate set in the coordinator node depends on the size of received local threshold-based probabilistic -skyband from each monitor node and the number of monitor nodes . Suppose that is an indicator function defined as
[TABLE]
In most application scenarios of big data, the size of local result, , is usually larger than . Thus, the average size of global candidate set will be
[TABLE]
where is the average size of the received local threshold-based probabilistic -skybands from the monitor nodes. Note that both and are usually much larger than in most big data applications. In general, is much smaller than due to the dominance and object pruning by the threshold.
Consider the worst case, the space complexity of candidate set in the monitor node can be denoted as
[TABLE]
where is the average size of the sliding windows in monitor nodes. The worst case only happens when all the uncertain data objects are anti-correlated in all dimensions. It means that the condition - holds and makes all the data objects in monitor nodes to be uploaded to the coordinator node. Thus, the space complexity of the worst case in the monitor node is . However, such a case is almost impossible to occur in big data environments.
After discussing the average and the worst space complexities on the coordinator node respectively, we can start discussing the time complexity of the computation on . In PTDMUS, instead of computing the global -skyband, just uses merge-sort to sort the received data objects from the monitor nodes by in a decreasing order, derives the expected checking time of , and finds the minimum checking time at each run (time ), where . Hence, the average time complexity for one run on , , can be formulated as
[TABLE]
Due to the usage of the minimum checking time, the expected average time complexity can be derived by
[TABLE]
where in (V-B) and . Additionally, the worst case occurs when (or ) and (12) holds. Then the worst time complexity can be obtained by
[TABLE]
V-C Transmission Cost
In general, the transmission cost depends on the sizes of local probabilistic -skybands and the global candidate set. According to the process of PTDMUS in Algorithm 1, the average transmission cost of a monitor node can be expressed as
[TABLE]
where , , and are respectively the local threshold-based -skyband, candidate set, and checking time table at the initial step (the fist time slot), as well as is the minimum set of candidate objects needed to be updated at the time slot. Note that is expressed as
[TABLE]
In general, is much smaller than and . With (15), PTDMUS only needs to upload the update information times during the monitor time . By contrast, PTDSky needs to upload information at every time slot of . In summary, Equations (13) to (V-C) are used to measure the average transmission cost of PTDMUS in the simulations.
The worst case of update cost only occurs when each input data object in the consequence time slots always becomes the top- dominating object. In this case, will become and the monitor node always needs to upload at every time slot. In addition, the worst transmission cost on the information exchange between and occurs when . Hence, the worst transmission cost (or network load) of a monitor node will be
[TABLE]
VI Simulation Results
The simulation including all compared approaches are implemented in JAVA with Spark using Eclipse IDE and the developed program is platform-independent. The simulation program is executed on a Windows 10 server with an Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz - 3.80GHz and 8GB 2 memory. In this simulation, we use synthetic data and the number of uncertain data objects is 10,000. We perform three different approaches for comparisons:
- •
PTDMUS performs with R-trees, threshold-based probabilistic -skyband in the monitor nodes, and PTDMUS performs with R-trees and the minimum checking time in the coordinator node;
- •
PTDSky executes with R-trees and threshold-based probabilistic -skyband in both monitor and coordinator nodes [13];
- •
PTDBF only runs with R-trees in a centralized way without any parallelism.
Since PTDBF is performed in a centralized server with the global information including all input data streams, PTDBF can always has the correct result of a top- dominating query. Hence, PTDBF is treated as the baseline method in the simulation. The performance of the above compared approaches is measured in terms of the computation time, transmission cost, precision, and recall, while considering the effects of threshold , data dimensionality, the number of monitor nodes, the size of sliding window, the value of , and the margin of uncertainty. In the previous section, both computation time and transmission cost have been detailedly analyzed in the average case and the worst case. The correctness/reliability of the proposed method is also important and thereby we validate the above methods in the simulation in terms of precision and recall. Suppose that is the result set of top- dominating objects obtained from PTDBF at time and is the one obtained from PTDMUS or PTDSky at time , the precision and recall can be obtained by
[TABLE]
We perform the simulations in 20 different scenarios and each scenario is executed runs (time slots) to get the average results and is set by (8). The detailed setting of parameters is presented in Table III.
VI-A Threshold
As shown in Fig. 4(a), the computation time of PTDSky grows linearly as the given threshold increases. This is because PTDSky needs to process more candidate objects for the threshold-based -skyband when the data dimension is high (), and when the threshold becomes loose, the computation time of PTDSky increases. PTDBF just needs ro handle all the uncertain data objects and sorts the result by the dominant score of object directly, thus having the worst computation time which is irrelevant to the threshold. PTDMUS has the best performance on computation time since it can avoid unnecessary computation on irrelevant objects with the minimum checking time. PTDMUS can perform almost 10 times faster than PTDSky when .
According to Fig. 4(b), PTDMUS can save almost 30% transmission cost comparing to PTDSky when . In general, both PTDSky and PTDMUS need higher transmission cost since the local and the global candidate sets become large as increases. However, with a table recording the minimum checking times of possible candidates, the monitor and coordinator nodes in PTDMUS do not need to exchange the information of candidate sets too much if the continuous query result does not change a lot. As a result, PTDMUS can outperform PTDSky significantly. In addition, when increases, the global candidate set on the coordinator node becomes larger. Then PTDMUS can record more information (minimum checking times) of candidate objects, thereby avoiding the unnecessary transmission of irrelevant objects. In the rest of simulation, we choose as the default threshold. Note that PTDBF performs in a centralized way, so it doesn’t have transmission cost.
Since we use threshold-based -skyband to prune irrelevant objects in our proposed approach, we now measure its influence on the accuracy of result for the query. Fig. 4(c) and Fig. 4(d) show that PTDMUS only loses less than 0.001% performance on accuracy and recall respectively. Such a tiny performance gap can be recognized as a tolerant error. In other words, with the minimum checking time, PTDMUS can reduce transmission cost significantly with good accuracy and recall in the meantime.
VI-B Data Dimensionality
Fig. 5(a) shows that PTDBF has poor performance on computation time, especially when the dimension is small. In general, for each data object, the number of its dominated objects is large when is small. In other words, each data object has a high probability to be dominated by the other objects when is small. Hence, PTDBF needs more computations on dominance checks when is small. In addition, from the implementation perspective, PTDBF needs much more branch operations (conditions) for the dominance checks between each pair of objects, so its computation time is the worst. In comparison with PTDBF, both -skyband based methods, PTDSky and PTDMUS, need less computation time since -skyband can utilize the characteristics of R-trees and MBRs for precluding irrelevant objects effectively. When the dimensionality becomes large (), PTDSky and PTDBF have similar performance in transmission time. On the other hand, PTDMUS has the best computation time and outperforms PTDSky and PTDBF by more than 85% when .
As Fig. 5(b) shown, PTDSky and PTDMUS have very similar performance in average computation cost when . In this simulation, the size of the given uncertain data set is 10,000. We can observe that PTDSky and PTDMUS need to transmit more than 9,000 candidate objects when . Such a phenomenon indicates that the score of an object dominating another objects decreases significantly, so the number of candidate objects becomes large and near to . In the case of , the coordinator node in PTDMUS can record the minimum checking time of more than 9,000 objects and the minimum checking time table can help coordinator node avoid transmitting the information of irrelevant objects to monitor nodes at some time slots (runs). Hence, the average transmission cost of PTDMUS can be improved nearly 20% in such a scenario ().
Even using a predictive mechanism to reduce the frequency of updating candidate objects, Fig. 5(c) and Fig. 5(d) show that PTDMUS can achieve almost the same performance on precision and recall as PTDSky does. In comparison with PTDSky, PTDMUS only loses 0.001% performances on both precision and recall for . In most applications, such a tiny lose of performance can be recognized as a tolerant error.
VI-C Number of Monitor Nodes
The considered environment is implemented in a parallel model. We now discuss the performance of each method in different scenarios with various numbers of monitor nodes. Note that there are no results of PTDBF in this part since PTDBF is a centralized method. Fig. 6(a) shows that both PTDMUS and PTDSky need less computation time if the number of monitor nodes increases. With the minimum checking time mechanism, PTDMUS precludes irrelevant objects more effectively than PTDSky does. Thus, PTDMUS outperforms PTDSky in computation time by almost 60%. Fig. 6(b) indicates that the average total transmission costs between monitor and coordinator nodes in PTDMUS and PTDSky are increasing as monitor nodes become more. In this simulation, we fix the size of sliding window in each monitor node, so the transmission cost is related to . With the minimum checking time table, PTDMUS can save about more than 2,000 transmission cost (objects) under various number of monitor nodes.
According to the simulation results in Fig. 6(c) and Fig. 6(d), PTDMUS is only worse than PTDSky on both accuracy and recall in different scenarios with different number of monitor nodes. Again, such a tiny performance gap can be recognized as a tolerant error for most applications. In summary, in comparison with PTDSky, PTDMUS reduces the average computation time and the transmission cost significantly while maintaining nearly identical accuracy and recall.
VI-D Size of Sliding Window
In this part, we discuss the effect of different sizes of sliding windows on monitor nodes. In Fig. 7(a), it shows that all the methods have better computation time performance when is relatively small () or large (). For each monitor node, the sliding window can be recognized as a buffer and it is used to save the data objects that need to be processed. In general, the computation cost will increase as the size of sliding window becomes big. It was shown in Fig. 7(a) that the computation time of all the methods is significantly reduced if . The reason is that the coordinator node records a large candidate set and the upper bound of its size is . In the case of , the coordinator node will record 9,600 data objects at most and it approaches to the size of the given data set . According to (8), all the methods only need to execute runs (slots) and each monitor node deals with only one new input object at every time slot. In such a scenario, the score for the new input object to become the candidate object is very low. Thus, the computation cost of updating the global candidate on the coordinator node also significantly decreases. In the case of , the reason for all the compared methods to have a good computation cost is that the small makes the computation time of each run (slot) very fast. In summary, PTDMUS has the best performance on computation cost in all the considered cases with different sizes of sliding windows. As shown in Fig. 7(b), PTDMUS needs lower transmission cost than PTDSky does in all the scenarios with different sizes of sliding windows. PTDSky has a similar performance to PTDMUS in transmission cost only when but PTDMUS is still better.
Fig. 7(c) and Fig. 7(d) show that both PTDMUS and PTDSky achieves 99.998% precision and recall when is 720 or 960. If , the performance gap between PTDMUS and PTDSky is smaller than 0.01% in terms of precision and recall. If , PTDMUS loses about 0.126% precision and 9.8% recall in comparison with PTDSky. However, such a high precision (99.87%) performance provided by PTDMUS is still allowable for most applications except for financial and emergency services.
VI-E Value of k
Since we consider the top- dominating query, various values of may affect the performance. Fig. 8(a) shows that the performance of all the methods in terms of computation time are independent of the value of . In the case of a high dimensional data set (), PTDSky is slightly worse than PTDBF since the coordinator node wastes too much computation time in computing the global probabilistic threshold-based -skyband with too many irrelevant objects. Such a similar result has been presented in Fig. 5(a). Conversely, PTDMUS improves more than 80% computation time comparing to PTDSky and PTDBF. From Fig. 8(b), we can observe that the value of is also independent of the transmission cost for both PTDMUS and PTDSky. The reason is that both PTDMUS and PTDSky use a threshold to preclude the irrelevant objects, where is much smaller than . In addition, PTDMUS can save almost 20% transmission cost due to the usage of the minimum checking time.
Fig. 8(c) and Fig. 8(d) show that PTDMUS has the same trend in precision and recall. The precision and recall of PTDMUS slightly increase as the value of increases. When , PTDMUS can achieve precision and recall. Although PTDSky also has the same precision and recall, PTDSky performs better than PTDMUS in precision and recall as the value of becomes smaller. PTDSky achieves 100% precision and recall when . If , PTDMUS will achieve better precision and recall than PTDSky does.
VI-F Margin of Uncertainty
We last discuss the effect of object’s margin of uncertainty , which is also called the object size. In general, the MBR of an uncertain data object becomes large as increases and thus the occurrence of partial dominance will increase. As shown in Fig. 9(a), PTDMUS has the best computation time performance and with 60% improvement in comparison with PTDSky and PTDBF. PTDSky and PTDBF have similar computation time for . When , PTDBF becomes the worst one due to the large occurrence of partial dominants. In Fig. 9(b), it is shown that the margin of uncertainty is independent to the size of the candidate set. Thus, the transmission costs of PTDMUS and PTDSky are not affected by the margin of uncertainty.
According to the results in Fig. 9(c) and Fig. 9(d), the precision and recall of the query result provided by PTDMUS linearly increase as the margin of uncertainty increases. On the other hand, the precision and recall of PTDSky’s query result increase more significantly as becomes larger. PTDSky can provide the query result with higher precision and recall only if . For , PTDMUS will has better precision and recall than PTDSky by more than 0.0025%.
VII Conclusion
In this paper, we have presented a new approach for Probabilistic Top- Dominating query over Multiple Uncertain data Streams (PTDMUS) to improve the computation efficiency of probabilistic top- dominating query for Edge-IoT applications. With the parallelism, the monitor nodes use the value of and threshold-based probabilistic -skyband to preclude most of the irrelevant objects in advance, thereby significantly reducing transmission cost. The coordinator node caches the temporary result and uses the proposed approach, minimum checking time, to reduce the frequency of computing the dominant score of each object in the cache table. Such a way can effectively minimize the computation time and incrementally update the result of the probabilistic top- dominating query with less update frequency. The simulation results show that PTDMUS can improve the computation performance effectively, while keeping good precision and recall of result.
In the future, we are going to apply PTDMUS to mobile edge computing frameworks for making the multi-criteria decision on the dynamic placement of drone base stations, thus providing reliable communication services for specific purposes and scenarios.
Acknowledgment
This research is partially supported by Ministry of Science and Technology under the Grant MOST 107-2221-E-027-099-MY2 and MOST 108-2634-F-009-006- through Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Tanganelli, C. Vallati, and E. Mingozzi, “Edge-centric distributed discovery and access in the internet of things,” IEEE Internet of Things Journal , vol. 5, no. 1, pp. 425–438, Feb. 2018.
- 2[2] A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi, “Internet of things for smart cities,” IEEE Internet of Things Journal , vol. 1, no. 1, pp. 22–32, Feb. 2014.
- 3[3] C. M. Huang, C. H. Shao, S. z. Xu, and H. Zhou, “The social internet of thing (s-iot)-based mobile group handoff architecture and schemes for proximity service,” IEEE Transactions on Emerging Topics in Computing , vol. 5, no. 3, pp. 425–437, Jul. 2017.
- 4[4] H. El-Sayed, S. Sankar, M. Prasad, D. Puthal, A. Gupta, M. Mohanty, and C. Lin, “Edge of things: The big picture on the integration of edge, iot and the cloud in a distributed computing environment,” IEEE Access , vol. 6, pp. 1706–1717, 2018.
- 5[5] J. Pan and J. Mc Elhannon, “Future edge cloud and edge computing for internet of things applications,” IEEE Internet of Things Journal , vol. 5, no. 1, pp. 439–449, Feb. 2018.
- 6[6] K. Gai, M. Qiu, H. Zhao, L. Tao, and Z. Zong, “Dynamic energy-aware cloudlet-based mobile cloud computing model for green computing,” Journal of Network and Computer Applications , vol. 59, pp. 46 – 54, Jan. 2016.
- 7[7] K. Gai, M. Qiu, and H. Zhao, “Energy-aware task assignment for mobile cyber-enabled applications in heterogeneous cloud computing,” Journal of Parallel and Distributed Computing , vol. 111, pp. 126 – 135, Jan. 2018.
- 8[8] K. Gai and M. Qiu, “Reinforcement learning-based content-centric services in mobile sensing,” IEEE Network , vol. 32, no. 4, pp. 34–39, Jul. 2018.
