LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs
Baodi Shan, Aabid Shamji, Jiannan Tian, Guanpeng Li, Dingwen Tao

TL;DR
This paper introduces LCFI, a fault injection tool designed to systematically analyze how lossy compression errors propagate and impact HPC applications, addressing a gap in understanding error effects in high-performance computing.
Contribution
The work presents a novel fault injection approach for lossy compressors, a customizable tool for systematic analysis, and an evaluation on HPC benchmarks revealing error propagation insights.
Findings
Lossy compression errors can significantly affect HPC program outputs.
The fault injection approach accurately models error propagation.
LCFI provides a comprehensive understanding of lossy compression error impacts.
Abstract
Error-bounded lossy compression is becoming more and more important to today's extreme-scale HPC applications because of the ever-increasing volume of data generated because it has been widely used in in-situ visualization, data stream intensity reduction, storage reduction, I/O performance improvement, checkpoint/restart acceleration, memory footprint reduction, etc. Although many works have optimized ratio, quality, and performance for different error-bounded lossy compressors, there is none of the existing works attempting to systematically understand the impact of lossy compression errors on HPC application due to error propagation. In this paper, we propose and develop a lossy compression fault injection tool, called LCFI. To the best of our knowledge, this is the first fault injection tool that helps both lossy compressor developers and users to systematically and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
