User-Defined Functions for HDF5
Lucas C. Villa Real, Maximilien de Bayser

TL;DR
This paper introduces an infrastructure for HDF5 that allows user-defined functions to be embedded and executed dynamically during dataset access, enabling flexible data processing and extension.
Contribution
It presents a novel architecture for integrating user-defined functions into HDF5 files, supporting on-demand data processing, security, and hardware acceleration.
Findings
Enables on-the-fly dataset population with UDFs
Supports hardware accelerators and computational storage
Demonstrates extended use cases for scientific data
Abstract
Scientific datasets are known for their challenging storage demands and the associated processing pipelines that transform their information. Some of those processing tasks include filtering, cleansing, aggregation, normalization, and data format translation -- all of which generate even more data. In this paper, we present an infrastructure for the HDF5 file format that enables dataset values to be populated on the fly: task-related scripts can be attached into HDF5 files and only execute when the dataset is read by an application. We provide details on the software architecture that supports user-defined functions (UDFs) and how it integrates with hardware accelerators and computational storage. Moreover, we describe the built-in security model that limits the system resources a UDF can access. Last, we present several use cases that show how UDFs can be used to extend scientific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Scientific Computing and Data Management · Parallel Computing and Optimization Techniques
