Exploiting Data Reduction Principles in Cloud-Based Data Management for Cryo-Image Data
Kashish Ara Shakil, Ari Ora, Mansaf Alam, Shabih Shakeel

TL;DR
This paper demonstrates that applying data reduction techniques like PCA in a cloud environment can significantly decrease storage costs for cryo-EM data, with potential scalability to other large datasets.
Contribution
The study introduces a scalable, cost-effective cloud-based data reduction method using PCA and MapReduce for cryo-EM data management.
Findings
Data reduction reduces storage costs by approximately 27%.
Efficient PCA implementation enables handling terabyte-scale cryo-EM data.
Method is scalable to various large-volume data types.
Abstract
Cloud computing is a cost-effective way for start-up life sciences laboratories to store and manage their data. However, in many instances the data stored over the cloud could be redundant which makes cloud-based data management inefficient and costly because one has to pay for every byte of data stored over the cloud. Here, we tested efficient management of data generated by an electron cryo microscopy (cryoEM) lab on a cloud-based environment. The test data was obtained from cryoEM repository EMPIAR. All the images were subjected to an in-house parallelized version of principal component analysis. An efficient cloud-based MapReduce modality was used for parallelization. We showed that large data in order of terabytes could be efficiently reduced to its minimal essential self in a cost-effective scalable manner. Furthermore, on-spot instance on Amazon EC2 was shown to reduce costs by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
