Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm
Christos Bellas, Anastasios Gounaris

TL;DR
This paper presents a GPU-accelerated approach to speed up the verification phase of set similarity joins, achieving up to 2.6 times faster performance by optimizing data handling and parallel processing.
Contribution
It introduces a novel GPU-based verification method for set similarity joins that fully overlaps with CPU tasks, significantly improving speed.
Findings
Achieved up to 2.6X speed-up in verification phase
Optimized data serialization and thread management for GPU
Demonstrated maximum potential of GPU acceleration in experiments
Abstract
We investigate the problem of exact set similarity joins using a co-process CPU-GPU scheme. The state-of-the-art CPU solutions split the wok in two main phases. First, filtering and index building takes place to reduce the candidate sets to be compared as much as possible; then the pairs are compared to verify whether they should become part of the result. We investigate in-depth solutions for transferring the second, so-called verification phase, to the GPU addressing several challenges regarding the data serialization and layout, the thread management and the techniques to compare sets of tokens. Using real datasets, we provide concrete experimental proofs that our solutions have reached their maximum potential, since they totally overlap verification with CPU tasks, and manage to yield significant speed-ups, up to 2.6X in our cases.
| Similarity function | Definition | Equivalent Overlap |
|---|---|---|
| Jaccard | ||
| Cosine | ||
| Dice | ||
| Overlap |
| , | Collections of sets to be joined |
|---|---|
| (resp.) | a token set from (resp. ) |
| Similarity threshold | |
| Set of candidate pairs | |
| Device output | |
| , | Token arrays |
| , , | Offset arrays |
| , ,, | Size of arrays in bytes |
| , , | |
| The host thread, | |
| Number of device threads | |
| Thread block size | |
| Host memory | |
| Device memory | |
| Device memory for candidate pairs |
| Dataset | Cardinality | Average set size | # diff tokens |
|---|---|---|---|
| AOL | 3 | ||
| BMS-POS | 6.5 | 1657 | |
| DBLP* | 88 | 7205 | |
| ENRON | 135 | ||
| KOSARAK | 8 | ||
| LIVEJOURNAL | 36.5 | ||
| ORKUT | 120 |
| join | index/filtering | verification | |||
|---|---|---|---|---|---|
| filtering | serialization | ||||
| 0.95 | 233 | 134 | 96 | 92 | 72.7GB |
| 0.9 | 2815 | 1892 | 921 | 698 | 0.56TB |
| 0.85 | 11367 | 8935 | 2430 | 2311 | 1.8TB |
| 0.95 | 0.90 | 0.85 | 0.80 | 0.75 | 0.7 | 0.65 | 0.6 | 0.55 | 0.5 | |
|---|---|---|---|---|---|---|---|---|---|---|
| ALL | 1 | 1 | 0 | 1 | 2 | 3 | 4 | 4 | 5 | 3 |
| PPJ | 2 | 2 | 2 | 4 | 4 | 3 | 2 | 2 | 1 | 3 |
| GRP | 4 | 4 | 5 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Time Series Analysis and Forecasting · Anomaly Detection Techniques and Applications
Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm
Christos Bellas, Anastasios Gounaris C. Bellas and A. Gounaris are with the Aristotle University of Thessaloniki, Greece.
Email: chribell, [email protected]
Abstract
We investigate the problem of exact set similarity joins using a co-process CPU-GPU scheme. The state-of-the-art CPU solutions split the wok in two main phases. First, filtering and index building takes place to reduce the candidate sets to be compared as much as possible; then the pairs are compared to verify whether they should become part of the result. We investigate in-depth solutions for transferring the second, so-called verification phase, to the GPU addressing several challenges regarding the data serialization and layout, the thread management and the techniques to compare sets of tokens. Using real datasets, we provide concrete experimental proofs that our solutions have reached their maximum potential, since they totally overlap verification with CPU tasks, and manage to yield significant speed-ups, up to 2.6X in our cases.
Index Terms:
Graphics processors, Parallelism and concurrency
1 Introduction
Given two collections of sets and a threshold, set similarity join is the operation of computing all pairs the overlap of which exceeds the given threshold. Similarity joins are used in a range of applications, such as plagiarism detection, web crawling, clustering and data mining and have been the subject of extensive research recently, e.g., [17, 3, 6, 11, 23].
In very large datasets, finding similar sets is not trivial. Due to the inherent quadratic complexity, a set similarity join between even medium sized datasets can take hours to complete on a single machine111E.g., a similarity join over the DBLP dataset using a Jaccard threshold of 0.85 takes 8.5 hours approx.. In addition, challenges like high dimensionality, sparsity, unknown data distribution and expensive evaluation arise. To tackle scalability challenges, two main and complementary approaches have been followed. Firstly, to devise sophisticated techniques, which safely prune pairs that cannot meet the threshold as early as possible, typically through simple computations related to the prefix and the suffix of the ordered sets, e.g. [17, 11]. Secondly, to benefit from massive parallelism offered by the MapReduce framework, e.g., [2, 24, 18, 23].
In this work, we explore a third direction, namely to couple techniques of the first approach with the massive parallelism offered by modern graphics cards. Modern GPUs offer a high-parallel environment at low cost. As a result, general-purpose computing on graphics processing units (GPGPU) has been introduced [13]. In general, GPGPU takes advantage of the different and complementary characteristics offered by CPUs and GPUs to improve performance. It has been employed in domains like deep learning, bioinformatics, numerical analytics and many others. However, implementing existing algorithms and techniques on the GPU requires in depth knowledge of the hardware and is often counter-intuitive. In addition, not all tasks are suitable for GPU-side processing. A traditional CPU surpasses at complex branching in application logic, while a GPU is superior at mass parallel execution of simple tasks and floating point operations [19].
To date, the problem of exact set similarity on GPUs has not been addressed apart from [22]; however, the proposal in [22] does not go into the implementation details, which is our focus. In general, in the literature, there exist several proposals for approximate set similarity or for similar problems, such as nearest neighbor search, e.g., [15, 8, 12] (the detailed discussion of related work is deferred to Section 6). Therefore, there is a gap in detailed investigation of exact similarity joins in GPUs. The main contribution of this work is to fill this gap and propose efficient solutions after thoroughly investigating several design alternatives.
Our work incorporates a co-process scheme between CPU and GPU in order to efficiently compute set similarity join. In our scheme, CPU remains responsible for index building and initial pruning of candidate pairs, whereas the GPU computes the overlap of all remaining pairs. This however leads to limited maximum speed-ups because of the Amdahl’s law. Moreover, the GPU part comes with several challenges regarding the data serialization and layout, the thread management and the techniques to compare sets of tokens. In this work, we address all challenges and show that we manage to achieve speed-ups up to 2.6X; moreover, we show that our scheme has reached its maximum potential in the sense that it annihilates the impact of GPU tasks on the running time. In summary, the technical contributions of our work are as follows:
- •
We provide a detailed implementation analysis and we propose alternatives that differ in the workload allocated to each GPU thread. We use the CUDA programming model, which is proprietary to NVIDIA [14] but widespread in practice. 222the code is publicly available from https://github.com/chribell/gpussjoin
- •
We conduct extensive performance analysis on seven real world datasets. We compare our findings to the state-of-art CPU implementations and point out a number of optimizations to further increase the performance.
- •
We provide evidence that in settings where the candidate pairs are tens of billions or more, our solutions reach their maximum potential, since they manage to fully hide the impact of GPU tasks on the running time due to overlapped execution with the CPU.
Paper outline. Next, we provide details on CUDA programming model and the main CPU-based approaches to set similarity joins. We present our solutions and the design alternatives involved in Section 3. Technical details are in Section 4. The experimental results are in Section 5 and the related work in Section 6. Finally, we conclude in Section 7.
2 Background
We provide a comprehensive overview of the CUDA programming model and explain its main concepts. We also introduce the filter-verification framework used by state-of-the-art main memory set similarity join algorithms in line with the comparison work conducted by Mann et al [17].
2.1 CUDA Overview
In CUDA terminology, CPU and the main memory are referred to as host, while the GPU and its own memory are referred to as device. In this work, we use the terms CPU and host (resp. GPU and device) interchangeably.
Architecture. CUDA-enabled GPUs have many cores called Streaming Processors (SPs), which are divided into groups called Streaming Multiprocessors (SMs). Each SM includes also other units such as ALUs, instruction units, memory caches for load/store operations, and follows the Single Instruction Multiple Data (SIMD) parallel processing paradigm.
Thread Organization. Threads are organized in logical blocks called thread blocks. A thread block is scheduled and executed in its entirety on a SM in groups of 32 threads called warps. Threads within a warp are called lanes and share the same instruction counter, thus they are executed simultaneously in a SIMD manner. There are two cases of thread divergence, which degrade performance, namely inter-warp, when concurrent warps run unevenly and intra-warp, when warp lanes take different execution paths. The latter is also simply referred to as warp divergence.
Memory hierarchy. There are several memory types on CUDA-enabled GPUs. They are divided into on-chip and off-chip ones. Off-chip memories include the global, constant, texture and local memory. The global memory is the largest (in the order of GBs) but slowest memory. Data transferred from the host to device resides in global memory and it is visible to all threads. The constant memory is read-only and much smaller (in the order of KBs). It is used for short access times on immutable data throughout the execution. The texture memory is essentially a read-only global memory and is preferred when 2-dimensional spatial locality occurs in memory access patterns. The local memory is part of the global memory and is used when the registers needed for a thread are fully occupied or cannot hold the required data. This is called register spilling. On-chip memories include the caches, the shared memory and the registers. For data reuse, there are caches per SM and the L2 cache is shared across all SMs. The shared memory is the second fastest memory type. Each SM has its own shared memory. Data stored in shared memory can be accessed by all threads within the same block, thus threads of the same block are allowed to inter-communicate via shared-memory. The registers are the fastest memory type and contain the instructions of a single thread and the local variables during the lifetime of that thread.
Kernel grid. Every function to be processed in parallel by the GPU is called a kernel. Each kernel is executed by multiple thread blocks, which form the kernel grid. The grid can be regarded as an array of blocks with up to two dimensions. Each block is, in turn, an array of threads with up to three dimensions. CUDA can schedule blocks to run concurrently on a SM depending on the shared memory and registers used per block. Increasing either of these factors can lead to limited concurrent block execution, which results in low occupancy. Occupancy is defined as the ratio of active warps on a SM to the maximum allowed active warps per SM. Maximizing occupancy is a good heuristic approach but it does not always guarantee performance gain. On the contrary, maintaining high warp execution efficiency, i.e. the average percentage of active threads in each executed warp, is a more robust approach for data-management tasks, as shown in executing generic theta-joins on GPUs [4].
2.2 Set Similarity joins
The state-of-the-art main memory set-similarity algorithms conform to a common filter-verification framework, as explained in [17], in which seven key representatives333AllPairs [3], PPJoin and PPJoin+ [29], MPJoin [21], MPJoin-PEL [16], AdaptJoin [25] and GroupJoin [6]. are compared using real world datasets. The common idea behind all these algorithms is (i) to avoid comparing all possible set pairs by applying filtering techniques on preprocessed data to prune as much candidate pairs as possible; and (ii) then to proceed to the actual verification of the remaining candidates. We summarize the key points of the work of [17] that are relevant to our research below.
2.2.1 Data layout.
Every dataset is a collection of multiple sets. Each set consists of elements called tokens. The data preprocessing phase involves a tokenization technique and deduplication of tokens. As a result of such a preprocessing phase, all the tokens of a set are unique. The input data tokens are represented by integers and are sorted by their frequency in increasing order, so that infrequent tokens appear first in a set. The sets of a collection are sorted first by their size and then lexicographically within each block of sets of equal size.
2.2.2 Set Similarity functions.
To measure the similarity between sets, Jaccard, Dice and Cosine normalized similarity functions are typically used. The given normalized threshold is translated to an equivalent overlap , which defines the minimum number of tokens that need to be shared between two sets to satisfy the threshold (see Table I). For example, if the Jaccard similarity threshold of two 10-token sets is set to 0.8, this is translated to an overlap threshold of 9 tokens that need to be shared.
2.2.3 Filters.
The most widely used filter, called prefix-filter, exploits the given threshold and similarity function by examining only two subsets, one from each set in the candidate pair, and discards the pair if there is no overlap between the subsets. For example, in Figure 1(a), there is no overlap between the respective set prefixes, thus, even if there is an overlap on the remaining tokens, any overlap threshold set to 4 or higher cannot be reached, and in such cases, the candidate pair can be safely pruned.
Another filter, noted as length filter, takes advantage of the normalized similarity functions dependency on set size. Hence, a candidate pair can be pruned if the set size inequality is not satisfied. In Figure 1(b), if , the shown candidate pair can be pruned despite the prefix equality because a 6-token set requires a set of size .
The last filter used in the examined algorithms is the positional filter. Given the first match position, it evaluates if a candidate pair can reach the similarity threshold. As an example, in Figure 1(c), if the threshold implies that at least 6 tokens should be shared, the pair is pruned since the remaining tokens from set are not enough to reach the similarity threshold.
2.2.4 Algorithm outline.
The set similarity join operation is achieved by executing an index nested loop join consisting of three steps. First, through an index lookup and a length filter application, a preliminary candidate set (pre-candidates) is generated. In the second step, pre-candidates are deduplicated and filtered. The pairs that pass all filters form the final candidates. These two steps compose the filtering phase. In the third and final step, also noted as verification in the literature, the similarity score for each of the remaining candidate pair is computed and if it exceeds the threshold, the pair is added to the output result.
A special case is where the set similarity join is a self-join using only a single collection of sets. In that case, a token set is first probed against the current index contents and then added to the index itself. This allows for incremental index building that is interleaved with verification. Also, the fact that a set that probes the index is always no shorter than the current indexed sets can be leveraged to speed-up verifications.
3 Our Approach
Let and be two collections of token sets. The set-similarity problem assumes that there exist a similarity function and a user-defined threshold . It is formally defined as follows.
Definition 1**.**
Problem Definition of set similarity joins: Find all pairs ( such that .
In a naive solution, the set of candidate pairs to be checked in the verification phase is all pairs . However, due to the filtering phase, for thresholds not close to 0, is typically a small subset of the cartesian product. The output of the verification is denoted as .
In our solution, we assume that the host (resp. device) is equipped with (resp. ) memory capacity. The host runs 3 threads (), while the device executes threads in blocks of size . The collections of token sets are transferred in the device main memory in a linearized form, denoted as and , and are accompanied by offset arrays and in order to distinguish the set boundaries. Table II summarizes the notation.
3.1 Main Rationale
As already explained, the algorithms solving the set similarity problem efficiently conform to the filter-verification framework. The filtering phase involves probing index structures such as a hashtable. Although there has been some work on implementing hashtables and inverted lists on GPUs [1], which can be employed in exact set similarity joins as in [22], we choose this phase to remain a CPU task. On the other hand, the verification phase is more suitable for parallelization, as it involves a merge-like loop, where the overlap of candidate pairs is computed. As soon as the overlap threshold is met or cannot be reached, the operation terminates. True positives must be verified and the necessary overlap is still computed, while the rejection of pairs in this stage leads to less token comparisons without sacrificing accuracy. Even though the average verification runtime is reported as constant for most datasets, employing the GPU for this part with a view to improving the overall performance is the main goal of this work.
In summary, in our approach, we allocate initial indexing and filtering to the CPU and the verification phase to the GPU. Next, we exploit the results of [17], according to which the three algorithms in the skyline, i.e., the most efficient ones are AllPairs, PPJoin and GroupJoin. We experiment with 5 real datasets on a NVIDIA Titan Xp GPU (full details are provided in Section 5). In Figure 2, we demonstrate the percentage of each phase on the total execution time. More specifically, due to the pipelined nature of processing, we present the upper bound of the verification fraction and the lower bound of the filtering one. The key observation of our tests is that indexing and filtering contributes to the total running time significantly. Therefore, due to the Amdahl’s law, employing the GPU for the verification in a ideal setting is expected to yield improvements of several times, but lower of an order of magnitude. Our techniques manage to achieve this (e.g., see Figure 11).
Based on the results both from [17] and our tests, in the following, we focus solely on AllPairs, PPJoin and GroupJoin. The key characteristics of the techniques are summarized below.
AllPairs (ALL). It is the first and most naive main memory algorithm to exploit the given threshold. During the inverted list lookup, it applies the prefix and length filters to prune candidate pairs [3].
PPJoin (PPJ). It extends ALL by applying the positional filter on pre-candidates [29]; therefore its verification phase is less loaded at the expense of a higher overhead during filtering.
GroupJoin (GRP). It is an extension to PPJ. Sets with identical prefix are grouped together. Each group is handled as a single set. Thus GRP has faster indexing, as it discards candidates pairs in batches. During the verification phase, the candidate pairs are expanded [6].
When looking for similar sets in a single collection, which is the most common case, instead of looking for similar pairs between two different collections, the whole join process can be done incrementally, i.e. for a probing set, first its candidates are verified and then the algorithm proceeds to the next set. Naively allocating and copying small chunks of data on the GPU through a different kernel invocation per probing set, would incur an enormous overhead penalty. A more efficient alternative is to copy a large chunk of data stored in linear memory space to the GPU, process it there and copy back the results. Adopting this approach also improves overall runtime as the CPU builds candidate pair collections in waves and feeds them in a non-blocking manner to the GPU, which conducts the verification. Thus, time overlapping between the CPU and GPU tasks can be achieved. More specifically, we propose a multithreaded framework regarding both the host and device tasks. We thoroughly analyze each side below.
Before delving into the details, we need to state our limitations. The most dominant constraining factor is the limited GPU memory. Due to this, the workload must be divided into chunks and the GPU should be invoked several times. The memory limitation mostly relates to the output size. As we have no prior knowledge about the output size, to ensure correctness the most straightforward solution is to allocate enough memory for the worst case, which is .
3.2 Host Tasks
The host side is responsible for the filtering phase and works as the coordinator. Specifically, the host runs three threads. The first thread, , conducts any filtering and builds chunks of candidates. When each chunk is built, the second thread, , noted as device handler, enacts the verification phase by copying the chunk to device memory and launching the kernel code. Meanwhile, continues to build the next chunk of candidates. As soon as the device output is copied back to host memory, the third thread post-process it to form the final pairs result. Note that may not be invoked if an aggregation is performed on top of the join, i.e., if only the count of pairs is needed instead of the actual pairs. In such a case, the device counts the number of pairs and returns the result to . A visual representation of the execution overlap is depicted in Figure 3, where each color corresponds to a different data chunk.
3.3 Device Tasks
The device side is responsible for the verification phase. It is invoked when the host prepares a chunk of candidate pairs. We present our data layout approach and discuss its impact on device memory.
There are two levels of concurrency in CUDA, grid and kernel. Grid level concurrency concerns mostly the overlap between computation and data transfers, while kernel level concurrency refers to how a single task is executed in parallel by many threads [7].
On the grid level, we further divide each input chunk of candidates into smaller chunks, each assigned to a different block. Thus we enhance the overlapping between device computation and host-to-device data transfer. On the kernel level, we summarize our approaches as high-level verification alternatives (described in detail below).
3.3.1 Data Layout
By default, we pass data to device as arrays stored in consecutive memory space. This is preferred because parallel execution benefits from coalesced global memory accesses. However, due to the nature of the problem, divergence in global memory access patterns is unavoidable, therefore the exploitation of on-chip memories is required to alleviate performance bottlenecks.
According to the linear memory layout, a collection is physically implemented as the composition of two arrays: tokens and offsets . The former holds every token of every set in the collection in a sequence, while the latter is used to delimit each set boundaries. Figure 4 depicts how a collection of sets is stored in the device memory. Collection is stored in similar fashion. We transfer any collection of sets once in the beginning of the process.
When the device is invoked to perform the verification phase, the host transfers an array of set IDs, noted as , alongside with an array of offsets () which indicates the candidate pairs to be evaluated. In addition, an array of equal length to , noted as , is allocated on the device and it is used to store the output result. Essentially, is an array of boolean flags where true indicates that the corresponding candidate pair similarity is equal or greater than the given threshold. Figure 5 shows the mapping of probing sets to their candidate pairs and the output result array (e.g. only (, ) pair is similar). Every even index position in corresponds to a probe set id and every odd one to a candidates offset. In the figure, since and , should be compared against only the first entry of . Since and , does not participate in any candidate pair. Finally, since and , should be verified against the next two sets in .
Memory restrictions. In our data layout, most memory space is required to store the array of candidate set IDs , which is of quadratic space complexity. Depending on the dataset and threshold value, the overall number of candidate pairs could reach billions. Considering that a set id is a 4-byte integer, space could be of several gigabytes. In addition, space is also required to store the output result (assuming that 0 and 1 take a byte rather than a bit). Due to the limited device memory, the host iteratively transfers as many candidates as the device memory can handle (so that both candidate set IDs and output fit into the global memory). This favors the overlap between both ends, as the host builds the next chunk of candidate pairs and the device conducts the verification in parallel.
3.3.2 Verification alternatives
We define three main kernel-level concurrency layers or dimensions of our problem as follows: grid layout (GL), which corresponds to thread execution; memory hierarchy (MH), which corresponds to efficient exploitation of the fastest on-chip memories; and last output writing (OW), which deals with result output. The latter is also distinguished into two cases depending on the type of querying being performed: output count (OC) for an aggregate query and output select (OS) for a full select of the similar pairs query.
These layers are tightly coupled and often intertwined, which means that certain options on a layer can rule out available options on the next ones. To investigate the impact of each available layer option, we present three alternative scenarios below, which differ in the workload assigned to a single thread, as shown in Figure 6.
Alternative A. In our first alternative, the workload we assign to each thread is a probing set and the evaluation of all its corresponding candidate pairs, i.e., a single thread becomes responsible for the verification of all candidate pairs involving a specific probing set.
GL: We launch a 1D grid of 1D blocks, with the overall number of threads executed across all blocks () being equal to the input set collection size (). Each thread is responsible for a probing set and conducts all the joins with the corresponding candidates .
MH: In this alternative, we do not use the shared memory. An option could be each thread to load the corresponding probing set to the shared memory, and then to access every candidate set from global memory. Thus, a thread does not access global memory for during the verification of each candidate pair. Since there is no fixed set size, blocks which handle sets of thousand tokens require larger amount of shared memory. For example, a thread block of 32 threads and average probing set size equal to 1000, would require KB of shared memory which exceeds the maximum of 48KB that modern GPUs can support. In such cases, an adaptive approach must be followed where the thread block size is limited to allow for proper execution. In summary, the option to load in shared memory implicitly defines the block size and gives rise to further challenges; thus is avoided in this work.
OC: As every thread verifies its own candidate pairs independently, it can also count the amount of pairs satisfying the threshold using a register. After finishing the verification, each counter can be stored in shared memory in order for a fast reduction on block level to be performed. The result of each block is then stored in global memory for a grid level reduction to output the global count. The amount of shared memory required depends on the block size, but it is small (e.g., for 32 threads per block we need 128 bytes).
OS: Having allocated the memory required for output array , a device thread updates specific cells of the array. Incorporating the shared memory in this output is not straightforward because each thread does not know beforehand the length of its output pairs. Hence, we have to allocate shared memory for the worst case scenario per thread. However this is practically impossible with the current GL. For example, if a block has 100000 candidate pairs to verify, we have to allocate 100KB of shared memory, which exceeds the maximum allowed space. As a result, the shared memory cannot be employed to speed-up the output generation.
Alternative B. In this technique, we allocate less work to each thread by shifting the workload of a single probing set from a single thread to a single thread block. By assigning the comparisons referring to a probing set to a thread block, threads evaluate only a portion of the candidate pairs in parallel. The main benefit of this alternative is that the workload of threads within a block is more evenly distributed.
GL: We launch a 1D grid of 1D blocks, with the number of blocks being equal to the input set collection size. Each thread block is responsible for a probing set and each thread is assigned with a portion of candidate pairs to verify.
MH: First, the block threads load the corresponding probing set to shared memory, then each thread verifies a portion of candidate pairs by accessing the corresponding candidate sets from global memory. Because we use shared memory for one probing set per block, unlike alternative A, the maximum supported probing set size also increases.
OC/OS: Same as Alternative A.
Alternative C. In our previous scenario, we try to improve performance on the warp level. We further extend the rationale of alternative B, and more specifically, each block is assigned with a probing set but with the difference that all the block threads cooperate to evaluate a candidate pair using the intersect path algorithm proposed in [10]. This further mitigates the problem of balancing, since the threads do not only become responsible for an equal number of candidates, but also perform a roughly equal number of operations.
GL: We launch a 1D grid of 1D blocks, with a number of thread blocks equal to the input set collection size. Each thread block is responsible for a probing set and multiple threads contribute to each candidate pair verification. A control thread outputs the result to global memory.
MH: Extending alternative B, due to thread cooperation, by default we also load candidate sets to shared memory. If there is not enough space to hold all candidate sets, we load data in chunks, perform the verification, and then proceed to the next chunk.
OC: Since all threads within a block cooperate to verify a candidate pair, we assign only a single thread to increment the block’s counter. Thus there is no need for a thread block counter reduction. However, grid level reduction is still required.
OS: The same applies for updating the output array . If a candidate pair meets the threshold, only a single thread updates the corresponding array cell.
4 Implementation Issues
4.1 Host Details
Our framework leverages the work of Mann [17]. The main difference is that we migrate the verification phase from the host to the device side via a multi-threaded implementation, which gives rise to the issues discussed in this section.
4.1.1 Candidate Serialization
The need to transfer candidate pairs to device memory highlights the necessity of efficient serialization methods. Our goal is for the device to avoid complex global memory accesses. Therefore, the host is responsible for storing the candidates of a probe set in successive memory addresses. We list our options for serializing candidates as follows:
Use a sequence container such as std::vector and push back every new candidate. The main drawback is the extra memory checks on insertion to determine if a reallocation is required. 2. 2.
Prior reserve memory space for std::vector to avoid memory checks. 3. 3.
Use primitive arrays and handle memory operations manually. 4. 4.
Use a map structure where a key is an integer, i.e. the probe set ID, and its value is a std::vector containing the corresponding candidates IDs.
As a complement to , a separate array to delimit candidate pairs is required. Moreover, the tokens that we insert to are pairs consisting of a probe set ID and its corresponding offset on . Omitting the probe set ID and inserting the candidate offset by itself, thus reducing the size, implies that the probe set ID should be capable to be extracted from the index of . For that to be possible, continuous probe sets IDs in ascending order must be processed, which might not always be the case.
The use of map is an intermediate stage to group together in memory probe set candidates. Before invoking the device, we must iterate the map and serialize every candidates list to construct the final , and also update per iteration.
As will be shown in our experiments, primitive arrays, i.e., the third option listed above, perform better than std::vector, i.e., the first two options, and are adequate for ALL and PPJoin that produce the full candidate set for a single probing set in a single phase. For GRP, which employs two phases during candidate generation, a map structure is necessary if the full verification phase is delegated to the GPU.
4.1.2 Thread & Memory Management
As stated in Section 3, our framework consists of three threads. The indexing/filtering thread reserves memory space beforehand and serializes candidates. When the device maximum memory for candidate pairs is filled, triggers , the device handler thread, and assigns a pointer to the current candidates to it. In parallel, allocates the required device memory space, copies the candidates array to device and launches the join kernel code. Meanwhile, continues to build the next chunk of candidates. As soon as the join kernel finishes, if an aggregation on top of the join is requested, launches immediately a separate kernel to perform a count reduction. In case of output the actual pairs, starts the , which post-process . When finishes, the memory reserved for the respective candidates is freed. The same steps are repeated until no candidate pairs are left to be verified.
4.1.3 GroupJoin Work Split
The three best-performing algorithms examined can be divided into two categories: those which generate every candidate pair in one phase, i.e. ALL and PPJ, and the one, GRP, which requires an extra phase (group expanding) to output all candidate pairs.
For each probe set in ALL and PPJ, candidates are guaranteed to be stored in successive memory addresses. Thus, we serialize candidates using primitive arrays. In contrast, GRP generates candidates that are intertwined due to the expanding phase. Therefore storing candidates in a map structure is required. However, in our testings serializing the map adds extra overhead, rendering this option unfeasible.
We choose to split the work for GRP, and for this technique, allocate part of the verification to the CPU as well. We assign the verification of every candidate pair generated in the first phase to the device. Hence, we serialize candidates from this phase in primitive arrays and transfer them to the device. Every candidate pair that emerges from the second phase, i.e. group expanding, is left to be verified in the host side by .
By splitting the work, we alleviate overall performance as shown in Section 5. In spite of this gain, transferring the whole GRP verification workload to device remains a challenge and would further improve the performance.
4.2 Device Details
4.2.1 Block size
We launch grids composed of 1D blocks. The block size must be a power of 2 for reduction on the shared memory to work properly. We prefer since a warp can be considered as the CPU thread equivalent. However in our evaluation with different block sizes, we show a correlation between block size and set size, and increasing the block size should be considered when the third verification alternative is the best performing one.
4.2.2 Merge and Intersect Path
Computing a list intersection can be derived from a list merge operation. An efficient parallel merging algorithm for GPUs is Merge Path[9]. Given two sorted lists and , Merge Path considers the order in which elements are merged, which is equivalent to the traversal of a grid, noted as Merge Matrix, of size . Beginning from the top left corner of the grid, the path can only move to the right if or downwards otherwise, until it eventually reaches the bottom right corner.
There are two partitioning stages, one on kernel grid level and the other on block level. On grid level, equidistant cross diagonals are placed on the Merge Matrix. Using binary search, the point of intersection for a cross diagonal and the path is found. As a result, each SM is assigned to merge non-overlapping portions of the input. On block level, threads cooperate in loading the required list portions on shared memory and then merge them in global memory.
By modifying Merge Path in [10], the authors propose a fast list intersection algorithm, called Intersect Path. They introduce a new diagonal path move, if . The same partitioning stages still hold. Each SM outputs a portion of the intersection to global memory.
Because in our approach thread blocks verify whole candidate pairs, we have to modify Intersect Path accordingly. Thus we perform both partitioning stages on block level.
4.2.3 Set Intersection Count
In order to verify a candidate pair, threads must calculate the intersection of two sets. Each thread in alternatives A and B, independently performs a merge-like loop and counts the number of intersects. On the other hand, in alternative C, threads collaborate to output the intersection. Since our problem can be reduced to finding the intersection count of two sets, we use a modified Intersect Path algorithm to divide the workload between block threads, as mentioned above.
Cross diagonals are equally placed apart at [10]. Each thread is assigned with a partition with starting point the intersection of the path and the corresponding diagonal. In Figure 7, we show an example of Intersect Path using threads (each color corresponds to a different thread). Starting from the top left corner, cross diagonals are placed apart 5-hops away, where hops are along the axes. Each thread calculates independently its partition intersection count and stores it in registers memory. When finished, intersection counts are copied in shared memory for a fast reduction to output the overall intersection.
4.2.4 Count Reduction
Our primary focus is to minimize the global memory transactions. Whenever possible, we use the registers to store counts per thread. Similarly, we store every thread count to shared memory and use warp shuffle functions for fast reduction. Hence, when performing an aggregation on top of the join, only a single write to global memory is required per block. To output the final count, we use thrust::reduce on the global memory-resident intermediate counts.
5 Evaluation
The goals of our experiments are threefold: (i) to show the extent to which the verification phase delegated to GPU is hidden (overlapped) by the index and filtering phases running on the CPU; (ii) to give concrete evidence about the speed-ups achieved in practice, and (iii) to provide explanations about the observed behavior.
5.1 Experiment setting
The experiments were conducted on a machine with an Intel i7 5820k clocked at 3.3GHz, 32 GB RAM at 2400MHz and an NVIDIA Titan XP. This GPU has 3840 CUDA cores, 12 GB of global memory and a 384-bit memory bus width.
The overall runtime, noted as join time, is the composition of candidate generation and serialization performed by host thread, and the verification conducted by the device. We refer to the former as index/filtering time and to the latter as verification time. We do not include any data preprocessing time spent for tokenization and de-duplication, which are perfomed exactly as in [17]. We conduct experiments for all datasets using ten thresholds in the range [0.5, 0.95]. We focus on self-joins using the Jaccard similarity and perform an aggregation on top of the set similarity join. The reported time for each experiment is an average over 3 independent runs (no significant deviation was observed). We measure the index/filtering and total join time with the std::chrono library. For the verification time we use the CUDA event API. The times for allocating device memory and transferring chunks of candidates to the device were negligible for all the experiments and furthermore were completely hidden due to the execution overlap.
We experiment with seven real world datasets that were also employed in [17]. Table III shows an overview of each dataset characteristics. Some datasets follow a Zipf-like distribution of set sizes, as shown in Figure 8, but in general, the distribution types differ. A summary of each dataset (adapted from [17]) is as follows:
AOL: query log data from the AOL search engine. Each set represents a query string and its tokens are search terms.
BMS-POL: purchase data from an e-shop. Each set represents a purchase and its tokens are product categories in that purchase.
DBLP: article data from DBLP bibliography. Each set represents a publication and its tokens are character -grams of the respective concatenated title and author strings.
ENRON: real e-mail data. Each set represents an e-mail and its tokens are words from either the subject or the body field.
KOSARAK: click-stream data from a Hungarian on-line news portal. Each set represents a user behavior and its tokens are links clicked by that user.
LIVEJOURNAL: social media data from LiveJournal. Each set represents a user and its tokens are interests of that user.
ORKUT: social media data form ORKUT network. Each set represents a user and its tokens are group memberships of that user.
5.2 Main Experiments
We compare our hybrid framework to the CPU standalone implementation of Mann[17]. In Figure 9, we show the speed-ups that can be achieved during the verification phase. These speed-ups can be up to more than 5X in our experiments. Nevertheless, in practice, it is more important to investigate the total response time, which inludes both the filtering/indexing and the verification phase.
In Figure 10, we present the best join times measured for both. Each time reported for the CPU is the overall best of the three algorithms execution and therefore the best we can achieve in our setup. Respectively for the GPU, each time reported is the best among unoptimized executions (, GB) of the three algorithms and the three alternatives. Thus, performance can be further improved, especially on datasets such as ENRON (Figure 10(d)) and ORKUT (Figure 10(g)) where the GPU performs similar or worse than the CPU.
As shown in Figure 10, for every dataset on large thresholds, i.e. the threshold range is in [0.7, 0.95], the GPU does not yield any performance speedup. Given the fact that in large thresholds the number of candidate pairs, hence the memory required to store them, is quite smaller than the device memory , the GPU remains idle during the candidate generation phase, only to be invoked once before the process finishes. On the other hand, billions of candidate pairs are generated when using smaller thresholds ([0.5, 0.65]). As a result, the GPU is invoked several times leading to an execution overlap and therefore to faster join times.
To further support this conclusion, we run our techniques over larger DBLP datasets, as shown in Figure 11 (where we present settings in which CPU takes up to an hour approximately). We can see that the speed-ups are much more evident, e.g., 2.6X in Figure 11(d), and are tangible even for larger thresholds, where the candidate pairs are tens of billions. In Figure 11(c), the candidates are 3.5B, 17B and 77B for thresholds 0.9, 0.8, and 0.7, respectively, and there are clear benefits for the last two settings.
Further, we drill-down on the GPU join time, as shown in Table IV, where it is shown that the GPU join time is solely attributed to the index/filtering time; the join time is roughly equal to the sum of filtering, index building and serialization. This means that our GPGPU scheme that assigns verification only to the GPU has reached its maximum potential.
Key result: Our GPGPU solution manages to hide the impact of verification phase on the running time of the similarity join, when (i) the candidates are in the order of tens of billions at least and (ii) the index/filtering and verification phases are intertwined.
5.3 Algorithm Performance
Table V shows which algorithm exhibited the best performance in Figure 10 on the GPU. We can observe, that as in the CPU comparison in [17], there is no algorithm that dominates the others. However, ALL favors low thresholds, PPJ mid-range and GRP the high ones, especially for the datasets with small average size.
Next, we discuss algorithm behavior issues in relation to the three datasets, where GPU does not show tangible benefits in Figure 10 even for relatively low thresholds, namely ENRON, KOSARAK and ORKUT.
5.3.1 Execution overlap and the role of
The most significant gain due to the device stems from the execution overlap between indexing and verification. ALL invokes the device earlier and more frequently because of its fast candidate generation on all datasets. PPJ and GRP have higher index times, generate less candidates and as a result invoke the device less frequently.
To exploit the execution overlap, and therefore keep the device busy most of the time, defining the appropriate is necessary. We examine the ENRON and ORKUT datasets. For these two datasets, all three algorithms have similar index times. The overall best algorithm for the standalone CPU execution is PPJ since it generates less candidates than ALL in the same indexing time frame and does not include any candidate expanding during the verification phase as GRP does. Figure 12(a) and Figure 12(b) show the join time for the ENRON and ORKUT dataset respectively, using PPJ as the index algorithm. By decreasing , and by fine-tuning as will be shown in the device performance section, we increase overlapped execution between CPU and GPU and manage to hide the verification time in the execution overlap.
5.3.2 GroupJoin Issues
The KOSARAK dataset is more efficiently processed by GRP regardless of the threshold. In Figure 13(a), we compare the join times of the CPU and two GPU executions for the KOSARAK dataset. In the first GPU execution, we use a map structure to delegate the whole verification phase to the device. In the second, we use raw arrays to delegate only the candidates generated in the first phase to the device, thus we split the verification workload between host and device. As it can be seen, the overhead imposed by using a map, which entails numerous memory checks, is not outweighed by faster verification. On the other hand, the drawback of assigning part of the verification to the CPU is that in datasets, such as KOSARAK, where the group expanding yields a larger number of candidates than the first phase, the host is assigned with significantly more verification workload than the device. Therefore, it is expected a GPGPU approach not to yield any benefits for such cases.
Another issue during the expanding phase is that the host iterates and skips candidate pairs that are to be verified on the device. This adds extra overhead. On every dataset, except AOL, BMS and KOSARAK, the group expanding phase generates less candidates than the first phase. Nonetheless, we cannot avoid the candidate skipping overhead. In Figure 13(b), we illustrate its impact on join time. For the DBLP dataset with , the group expanding does not generate any candidate, thus, if we remove the expanding phase, we achieve a performance gain.
5.4 Device Performance
There are two kernel parameters, which require fine tuning depending on the dataset characteristics: (i) the verification alternative to be followed and (ii) the thread block size.
5.4.1 Comparison of our proposed alternatives
The two main differences between the verification alternatives is how threads access global memory and how they calculate the intersection of two sets. Each thread in alternative A accesses its respective candidate pair sets and calculates each intersection. Because of that, intra-warp divergence is maximized since a thread has its own execution path. Alternative B alleviates the performance because the threads of a block collaborate to load the corresponding probing set to shared memory and then, less candidate pairs per thread are verified. However, since each thread independently loads a candidate set and calculates the intersection, intra-warp divergence is still present. In alternative C, each block’s threads collaborate first to load the probing and each candidate set to shared memory, and second, to perform the intersections. Therefore intra-warp divergence is low.
For datasets with small average set size () such as AOL, BMS, KOSARAK, the global memory access footprint is also small, which renders alternatives A and B competitive. As shown in Figure 14(a), both A and B have similar performance for large thresholds, but for small ones alternative B performs better since thousands more thread blocks are launched and thus device occupancy is increased. On the contrary, alternative C seems infeasible for small set sizes, since the overhead to store them in shared memory dominates the verification time. The advantage of alternative C becomes apparent in datasets characterized by larger average set size such as DBLP, ENRON, LIVEJOURNAL and ORKUT. Figure 14(b) shows the performance of the three alternatives for the LIVEJOURNAL dataset. As the number of candidate pairs increases for small thresholds, alternative C performs better since it achieves a higher warp execution efficiency (200% and 33% increase from A and B respectively). In addition, it has lower memory dependency as shown in Figure 14(c). By minimizing the global memory access footprint, alternative C becomes mainly execution dependent. Other dependencies such as instruction fetch, instruction issue and synchronization arise, but they are less expensive than constantly accessing the global memory in a non-optimal way.
Key result: It is more beneficial to employ alternative B for sets of small size, and alternative C otherwise.
5.4.2 Block size
We investigate how the thread block size affects the verification time. For datasets with small average set size, where alternative B is preferred, has the best performance. Increasing in such datasets leads to a higher proportion of inactive warps and hence increases the verification time as shown in Figure 15(a). As the number of average set size in candidate pairs increases, alternative C is preferred to minimize the global memory access bottleneck. Alongside, fine tuning of is also required to achieve best device performance. For example, as shown in Figure 15(b) for the ORKUT dataset, if we assign , each thread receives more workload than optimal. By increasing up to 128, the workload is more evenly distributed since more threads contribute to the join. If we further increase , this leads to a high number of inactive warps, and therefore to low warp execution efficiency.
Key result: Judiciously increasing when alternative C is employed leads to higher performance. When combined with lower , it manages to fully overlap verification and index/filtering phases.
6 Related Work
Although extensive research has been carried out on set similarity join for parallel paradigms, such as MapReduce[24, 18, 23], there are few studies investigating set similarity join on the GPGPU paradigm. An early proposal has appeared in [15], according to which Lieberman et al cast the similarity join operation as a GPU sort-and-search problem. First, they create a set of space-filling curves using bitonic sort on one of the input relations; then, they process each record from the other relation in parallel by executing searches in the space filling curves, using the Minkowski metric for similarity. In [8], the authors employ the parallel-friendly MinHash algorithm to estimate the Jaccard similarity of two sets. Their solution is space-efficient since they only store set signatures instead of whole sets to perform the similarity join. However, due to the MinHash nature (i.e. data partitioning in bins), fine-tuning is required to achieve balance between accuracy (to avoid false positives) and execution time. The main limitations of the above techniques is that they are approximate, whereas we propose solutions to the problem of exact set similarity joins, and we are not inherently limited to Jaccard similarity only.
The only known work to date on exact similaity joisn on GPUs has appeared in [22]. This work allocates both the indexing/filtering and the verification phase on GPUs and exhibits promising results. As an extension of our work, we also aim to investigate design alternatives so that both set similarity join phases are performed on GPUs.
Similarity joins are also discussed in [5], where two nested loop join (NLJ) algorithms are presented: a naive NLJ and a faster index-supported NLJ. The index is created on the CPU side during the preprocessing phase. Both algorithms use the Euclidean distance for similarity and thus they are not suitable for set similarity joins. Nevertheless, our solutions also perform sophisticated CPU-side indexing before the GPU-side processing.
Another problem, which is close to set similarity join and has been studied on the GPGPU paradigm, is similarity (nearest neighbor) search. Examples include [20, 30, 27, 28, 12] but none of them can be applied to our problem
7 Conclusion
This work describes the first thorough ivestigation to date regarding the design alternatives for the verification phase in exact set similarity joins using a GPU. We conform to the established filter-verification framework, and we transfer verification to the GPU. We provide solutions to the issues involved, such as data layout on the device memory, serialization, and thread workload. Using real datasets, we show that we manage to fully overlap the GPU tasks with the CPU ones, when the candidate pairs are at the order of tens of billions of sets. Despite the significant speed-ups reported, in several cases, exact set similarity remains a very expensive task due to the non-parallelized filter phase. The main direction for future work is to explore GPU-tailored solutions for filtering as well in line with works such as [22, 30], building upon also recent advances for CPUs, such as those in [26].
Acknowledgments
The authors gratefully acknowledge the support of NVIDIA Corporation through the donation of the GPU used.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Ashkiani, M. Farach-Colton, and J. D. Owens. A dynamic hash table for the gpu. ar Xiv preprint ar Xiv:1710.11246 , 2017.
- 2[2] R. Baraglia, G. D. F. Morales, and C. Lucchese. Document similarity self-join with mapreduce. In ICDM , pages 731–736, 2010.
- 3[3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007 , pages 131–140, 2007.
- 4[4] C. Bellas and A. Gounaris. GPU processing of theta-joins. Concurrency and Computation: Practice and Experience , 29(18), 2017.
- 5[5] C. Böhm, R. Noll, C. Plant, and A. Zherdin. Index-supported similarity join on graphics processors. In BTW , volume 144, pages 57–66, 2009.
- 6[6] P. Bouros, S. Ge, and N. Mamoulis. Spatio-textual similarity joins. PVLDB , 6(1):1–12, 2012.
- 7[7] J. Cheng, M. Grossman, and T. Mc Kercher. Professional Cuda C Programming . John Wiley & Sons, 2014.
- 8[8] M. S. Cruz, Y. Kozawa, T. Amagasa, and H. Kitagawa. Gpu acceleration of set similarity joins. In International Conference on Database and Expert Systems Applications , pages 384–398, 2015.
