Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Movement

Felipe Arambur\'u; William Malpica; Kaouther Abrougui; Amin Aramoon; Romulo Auccapuclla; Claude Brisson; Matthijs Brobbel; Colby Farrell; Pradeep Garigipati; Joost Hoozemans; Supun Kamburugamuve; Akhil Nair; Alexander Ocsa; Johan Peltenburg; Rub\'en Quesada L\'opez; Deepak Sihag; Ahmet Uyar; Dhruv Vats; Michael Wendt; Jignesh M. Patel; Rodrigo Arambur\'u

arXiv:2508.05029·cs.DC·August 8, 2025

Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Movement

Felipe Arambur\'u, William Malpica, Kaouther Abrougui, Amin Aramoon, Romulo Auccapuclla, Claude Brisson, Matthijs Brobbel, Colby Farrell, Pradeep Garigipati, Joost Hoozemans, Supun Kamburugamuve, Akhil Nair, Alexander Ocsa, Johan Peltenburg, Rub\'en Quesada L\'opez, Deepak Sihag

PDF

TL;DR

Theseus is a distributed, GPU-accelerated query engine that optimizes data movement and computation, significantly improving performance and cost-efficiency for large-scale analytical workloads on cloud infrastructure.

Contribution

It introduces a scalable, enterprise-ready query engine that tightly integrates hardware-aware data movement control mechanisms for efficient GPU-accelerated analytics.

Findings

01

Outperforms Databricks Photon by up to 4x at cost parity on TPC-H benchmarks.

02

Successfully processes 100 TB scale datasets with minimal hardware.

03

Achieves efficient data handling through specialized asynchronous control mechanisms.

Abstract

Online analytical processing of queries on datasets in the many-terabyte range is only possible with costly distributed computing systems. To decrease the cost and increase the throughput, systems can leverage accelerators such as GPUs, which are now ubiquitous in the compute infrastructure. This introduces many challenges, the majority of which are related to when, where, and how to best move data around the system. We present Theseus -- a production-ready enterprise-scale distributed accelerator-native query engine designed to balance data movement, memory utilization, and computation in an accelerator-based system context. Specialized asynchronous control mechanisms are tightly coupled to the hardware resources for the purpose of network communication, data pre-loading, data spilling across memories and storage, and GPU compute tasks. The memory subsystem contains a mechanism for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.