G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression
Feng Zhang, Zaifeng Pan, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur, Mutlu, Xiaoyong Du

TL;DR
G-TADOC is a novel GPU framework that enables efficient text analytics directly on compressed data without decompression, overcoming dependency, synchronization, and sequence maintenance challenges to significantly accelerate processing.
Contribution
It introduces a GPU-based framework for direct text analytics on compressed data, with innovative workload scheduling, thread-safe memory management, and sequence preservation strategies.
Findings
Achieves 31.1x average speedup over state-of-the-art TADOC.
First GPU framework for direct text analytics on compressed data.
Effectively handles dependencies, synchronization, and sequence maintenance.
Abstract
Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
