Evaluation of Provenance Serialisations for Astronomical Provenance
Michael A. C. Johnson, Marcus Paradies, Hans-Rainer Kl\"ockner, and Albina Muzafarova, Kristen Lackeos, David J. Champion, Marta, Dembska, Sirko Schindler

TL;DR
This study compares turtle and JSON provenance serialisations for astronomical data, evaluating their efficiency in storage, upload, and querying within different database systems to inform best practices for large-scale astronomical surveys.
Contribution
It provides an empirical comparison of turtle and JSON provenance serialisations using representative database systems, highlighting their relative efficiencies for storage and complex querying.
Findings
Turtle serialisation is more efficient for storage and small, simple queries.
JSON serialisation performs better for complex pattern-matching queries.
Both serialisations have similar query accuracy.
Abstract
Provenance data from astronomical pipelines are instrumental in establishing trust and reproducibility in the data processing and products. In addition, astronomers can query their provenance to answer questions routed in areas such as anomaly detection, recommendation, and prediction. The next generation of astronomical survey telescopes such as the Vera Rubin Observatory or Square Kilometre Array, are capable of producing peta to exabyte scale data, thereby amplifying the importance of even small improvements to the efficiency of provenance storage or querying. In order to determine how astronomers should store and query their provenance data, this paper reports on a comparison between the turtle and JSON provenance serialisations. The triple store Apache Jena Fuseki and the graph database system Neo4j were selected as representative database management systems (DBMS) for turtle and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices
