How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

Almond Kiruthu Murimi

arXiv:2511.11749·cs.DC·November 18, 2025

How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

Almond Kiruthu Murimi

PDF

Open Access

TL;DR

This paper explores how machine learning-based data replication strategies can improve fault tolerance in large-scale distributed systems by enabling adaptive, predictive, and self-optimizing data management to reduce downtime and resource wastage.

Contribution

It introduces novel adaptive replication mechanisms utilizing predictive analytics and reinforcement learning to enhance fault tolerance in distributed systems.

Findings

01

ML-driven strategies outperform traditional methods in fault prediction

02

Adaptive replication reduces system downtime significantly

03

Recommendations for deploying ML solutions in real-world systems

Abstract

This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to adapt to dynamic workloads and unexpected failures, leading to inefficient resource utilization and prolonged downtime. By integrating machine learning techniques-specifically predictive analytics and reinforcement learning. The study proposes adaptive replication mechanisms capable of forecasting system failures and optimizing data placement in real time. Through an extensive literature review, qualitative analysis, and comparative evaluations with traditional approaches, the paper identifies key limitations in existing replication strategies and highlights the transformative potential of machine learning in creating more resilient, self-optimizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Cloud Computing and Resource Management