Fault Oblivious Eigenvalue Solver
Jayanta Mukherjee, Xuejiao Kang, David F. Gleich, Ahmed Sameh, Ananth Grama

TL;DR
This paper introduces a fault-tolerant eigenvalue solver using erasure-coded computations, enabling reliable large-scale eigenvalue analysis on massively parallel systems despite hardware failures.
Contribution
It presents a novel erasure-coded approach that reformulates eigenvalue problems for fault-oblivious computation, ensuring robustness and efficiency in large-scale environments.
Findings
Effective fault-tolerant eigenvalue computation demonstrated
Minimal overhead with robust convergence maintained
Scales efficiently with increasing faults
Abstract
Eigenvalue problems serve as fundamental substrates for applications in large-scale scientific simulations and machine learning, often requiring computation on massively parallel platforms. As these platforms scale to hundreds of thousands of cores, hardware failures become a significant challenge to reliability and efficiency. In this paper, we propose and analyze a novel fault-tolerant eigenvalue solver based on erasure-coded computations -- a technique that enhances resilience by augmenting the system with redundant data a priori. This transformation reformulates the original eigenvalue problem as a generalized eigenvalue problem, enabling fault-oblivious computation while preserving numerical stability and convergence properties. We formulate the augmentation scheme, establish the necessary conditions for the encoded blocks, and prove the relationship between the original and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Interconnection Networks and Systems
