A Reinforcement Learning-Based Task Mapping Method to Improve the   Reliability of Clustered Manycores

Fatemeh Hossein-Khani; Omid Akbari

arXiv:2412.19340·cs.LG·December 30, 2024

A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores

Fatemeh Hossein-Khani, Omid Akbari

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning-based task mapping method for manycore systems that enhances reliability by minimizing thermal variations, thereby increasing the mean time to failure without offline parameter tuning.

Contribution

It presents a novel RL-based approach for runtime task mapping considering aging effects, outperforming existing methods in reliability improvement.

Findings

01

Up to 27% increase in mean time to failure (MTTF).

02

Effective runtime reliability enhancement without offline parameter tuning.

03

Validated on systems with 16, 32, and 64 cores using benchmark applications.

Abstract

The increasing scale of manycore systems poses significant challenges in managing reliability while meeting performance demands. Simultaneously, these systems become more susceptible to different aging mechanisms such as negative-bias temperature instability (NBTI), hot carrier injection (HCI), and thermal cycling (TC), as well as the electromigration (EM) phenomenon. In this paper, we propose a reinforcement learning (RL)-based task mapping method to improve the reliability of manycore systems considering the aforementioned aging mechanisms, which consists of three steps including bin packing, task-to-bin mapping, and task-to-core mapping. In the initial step, a density-based spatial application with noise (DBSCAN) clustering method is employed to compose some clusters (bins) based on the cores temperature. Then, the Q-learning algorithm is used for the two latter steps, to map the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · Elevator Systems and Control · Industrial Vision Systems and Defect Detection

MethodsQ-Learning