AwkwardForth: accelerating Uproot with an internal DSL
Jim Pivarski, Ianna Osborne, Pratyush Das, David Lange, and Peter, Elmer

TL;DR
AwkwardForth introduces a portable, Forth-based virtual machine to accelerate deserialization of complex data formats into Awkward Arrays, achieving performance comparable to C++ ROOT and significantly faster than fastavro.
Contribution
It presents a novel, virtual machine-based approach for data deserialization that improves portability and speed over existing methods.
Findings
Deserialization speeds are comparable to C++ ROOT for record formats.
Achieves 10-80x faster performance than fastavro.
Columnar formats benefit from precompiled code for faster interpretation.
Abstract
File formats for generic data structures, such as ROOT, Avro, and Parquet, pose a problem for deserialization: it must be fast, but its code depends on the type of the data structure, not known at compile-time. Just-in-time compilation can satisfy both constraints, but we propose a more portable solution: specialized virtual machines. AwkwardForth is a Forth-driven virtual machine for deserializing data into Awkward Arrays. As a language, it is not intended for humans to write, but it loosens the coupling between Uproot and Awkward Array. AwkwardForth programs for deserializing record-oriented formats (ROOT and Avro) are about as fast as C++ ROOT and 10-80 faster than fastavro. Columnar formats (simple TTrees, RNTuple, and Parquet) only require specialization to interpret metadata and are therefore faster with precompiled code.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
