Oobleck: Low-Compromise Design for Fault Tolerant Accelerators
Guy Wilks, Brian Li, Jonathan Balkind

TL;DR
Oobleck introduces a modular, low-area fault-tolerant architecture for accelerators, supported by the Viscosity language, reducing data center costs and maintaining high performance under faults.
Contribution
The paper presents Oobleck, a novel modular architecture for fault-tolerant accelerators, and Viscosity, a language for hardware-software co-design, enabling efficient fault tolerance with minimal area overhead.
Findings
Can maintain speedups of 1.7x-5.16x under faults
Reduces data center costs by decreasing failure-induced chip replacements
Supports fault tolerance in FFT, AES, and DCT accelerators
Abstract
Data center hardware refresh cycles are lengthening. However, increasing processor complexity is raising the potential for faults. To achieve longevity in the face of increasingly fault-prone datapaths, fault tolerance is needed, especially in on-chip accelerator datapaths. Previously researched methods for adding fault tolerance to accelerator designs require high area, lowering chip utilisation. We propose a novel architecture for accelerator fault tolerance, Oobleck, which leverages modular acceleration to enable fault tolerance without burdensome area requirements. In order to streamline the development and enforce modular conventions, we introduce the Viscosity language, an actor based approach to hardware-software co-design. Viscosity uses a single description of the accelerator's function and produces both hardware and software descriptions. Our high-level models of data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
