A pseudo-parallel Python environment for database curation
Eckhard Sutorius (1), Johann Bryant (1), Ross Collins (1), Nicholas, Cross (1), Nigel Hambly (1), Mike Read (1) ((1) Scottish Universities Physics, Alliance (SUPA), Institute for Astronomy, School of Physics, University of, Edinburgh, UK)

TL;DR
This paper presents a hybrid Python/C++ environment that enables fast, parallel ingestion of large database metadata, significantly reducing processing times for astronomical data archives.
Contribution
It introduces a pseudo-parallel processing pipeline that efficiently ingests large datasets into databases using data splitting and multi-computer processing.
Findings
Parallel processing reduces ingestion time significantly.
Data splitting into daily chunks improves efficiency.
The approach maximizes CPU utilization during data ingestion.
Abstract
One of the major challenges providing large databases like the WFCAM Science Archive (WSA) is to minimize ingest times for pixel/image metadata and catalogue data. In this article we describe how the pipeline processed data are ingested into the database as the first stage in building a release database which will be succeeded by advanced processing (source merging, seaming, detection quality flagging etc.). To accomplish the ingestion procedure as fast as possible we use a mixed Python/C++ environment and run the required tasks in a simple parallel modus operandi where the data are split into daily chunks and then processed on different computers. The created data files can be ingested into the database immediately as they are available. This flexible way of handling the data allows the most usage of the available CPUs as the comparison with sequential processing shows.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Scientific Computing and Data Management · Distributed and Parallel Computing Systems
