# ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data

**Authors:** Elias Stehle, Hans-Arno Jacobsen

arXiv: 1905.13415 · 2020-04-16

## TL;DR

This paper introduces ParPaRaw, a GPU-based massively parallel algorithm for parsing delimiter-separated raw data that avoids initial sequential passes, supports complex parsing rules, and achieves high throughput of up to 14.2 GB/s.

## Contribution

It presents a flexible, high-performance GPU parsing algorithm that does not require initial input analysis and supports expressive parsing rules, improving over state-of-the-art methods.

## Key findings

- Achieves parsing rates up to 14.2 GB/s on GPU
- Scales efficiently to thousands of cores
- Parses 4.8 GB in 0.44 seconds including data transfer

## Abstract

Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively parallel algorithm for parsing delimiter-separated data formats on GPUs. Other than the state-of-the-art, the proposed approach does not require an initial sequential pass over the input to determine a thread's parsing context. That is, how a thread, beginning somewhere in the middle of the input, should interpret a certain symbol (e.g., whether to interpret a comma as a delimiter or as part of a larger string enclosed in double-quotes). Instead of tailoring the approach to a single format, we are able to perform a massively parallel FSM simulation, which is more flexible and powerful, supporting more expressive parsing rules with general applicability. Achieving a parsing rate of as much as 14.2 GB/s, our experimental evaluation on a GPU with 3584 cores shows that the presented approach is able to scale to thousands of cores and beyond. With an end-to-end streaming approach, we are able to exploit the full-duplex capabilities of the PCIe bus and hide latency from data transfers. Considering the end-to-end performance, the algorithm parses 4.8 GB in as little as 0.44 seconds, including data transfers.

## Figures

21 figures with captions in the complete paper: https://tomesphere.com/paper/1905.13415/full.md

---
Source: https://tomesphere.com/paper/1905.13415