Design of a Near-Ideal Fault-Tolerant Routing Algorithm for Network-on-Chip-Based Multicores
Costas Iordanou, Vassos Soteriou, Konstantinos Aisopos

TL;DR
This paper introduces Hermes, a fault-tolerant routing algorithm for Network-on-Chip-based multicores that maintains high throughput and robustness despite increasing faulty links, ensuring reliable chip operation.
Contribution
Hermes is a novel hybrid routing algorithm that combines load-balancing and preconfigured escape paths to enhance fault tolerance and performance in NoC-based multicores.
Findings
Hermes improves network throughput by up to 3x over existing methods.
Hermes gracefully degrades performance as faulty links increase.
Hermes effectively detects network partitions caused by dense faults.
Abstract
With relentless CMOS technology downsizing Networks-on-Chips (NoCs) are inescapably experiencing escalating susceptibility to wearout and reduced reliability. While faults in processors and memories may be masked via redundancy, or mitigated via techniques such as task migration, NoCs are especially vulnerable to hardware faults as a single link breakdown may cause inter-tile communication to halt indefinitely, rendering the whole multicore chip inoperable. As such, NoCs impose the risk of becoming the pivotal point of failure in chip multicores that utilize them. Aiming towards seamless NoC operation in the presence of faulty links we propose Hermes, a near-ideal fault-tolerant routing algorithm that meets the objectives of exhibiting high levels of robustness, operating in a distributed mode, guaranteeing freedom from deadlocks, and evening-out traffic, among many. Hermes is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Advanced Memory and Neural Computing · Distributed systems and fault tolerance
