Pseudonymization at Scale: OLCF's Summit Usage Data Case Study
Ketan Maheshwari, Sean R. Wilkinson, Alex May, Tyler, Skluzacek, Olga A. Kuchar, Rafael Ferreira da Silva

TL;DR
This paper presents a scalable pseudonymization workflow for large HPC system log datasets, enabling privacy-preserving data sharing while maintaining data utility, demonstrated on the OLCF Summit supercomputer's user data.
Contribution
The paper introduces a parallelized pseudonymization workflow for large-scale HPC log data, improving processing efficiency and enabling open data sharing with privacy considerations.
Findings
Workflow reduces processing time from 20+ hours to around 2 hours.
Demonstrates scalability of pseudonymization on multiple HPC systems.
Publishes the pseudonymized dataset and workflow for community use.
Abstract
The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on HPC systems. A common approach to gathering data about user behavior is to analyze system log data available only to system administrators. Recently at Oak Ridge Leadership Computing Facility (OLCF), however, we unveiled user behavior about the Summit supercomputer by collecting data from a user's point of view with ordinary Unix commands. Here, we discuss the process, challenges, and lessons learned while preparing this dataset for publication and submission to an open data challenge. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Scientific Computing and Data Management
