
TL;DR
This paper proposes a SmartNIC-based approach to offload data decoding and processing tasks in cloud data lakes, significantly improving query performance and reducing CPU costs.
Contribution
It introduces a novel SmartNIC architecture for offloading decoding and pushed-down operators, enhancing data lake query efficiency.
Findings
SmartNIC offloading can significantly increase query processing performance.
Operating on pre-filtered data reduces CPU costs and maintains throughput.
Experimental estimates show comparable performance to traditional CPU setups.
Abstract
Data lakes spend a significant fraction of query execution time on scanning data from remote, disaggregated storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of parsing raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data, as delivered by a SmartNIC, we can significantly increase query processing performance and can still match query throughput of traditional setups with smaller, less expensive CPUs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
