How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems
Almond Kiruthu Murimi

TL;DR
This paper explores how machine learning-based data replication strategies can improve fault tolerance in large-scale distributed systems by enabling adaptive, predictive, and self-optimizing data management to reduce downtime and resource wastage.
Contribution
It introduces novel adaptive replication mechanisms utilizing predictive analytics and reinforcement learning to enhance fault tolerance in distributed systems.
Findings
ML-driven strategies outperform traditional methods in fault prediction
Adaptive replication reduces system downtime significantly
Recommendations for deploying ML solutions in real-world systems
Abstract
This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to adapt to dynamic workloads and unexpected failures, leading to inefficient resource utilization and prolonged downtime. By integrating machine learning techniques-specifically predictive analytics and reinforcement learning. The study proposes adaptive replication mechanisms capable of forecasting system failures and optimizing data placement in real time. Through an extensive literature review, qualitative analysis, and comparative evaluations with traditional approaches, the paper identifies key limitations in existing replication strategies and highlights the transformative potential of machine learning in creating more resilient, self-optimizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Cloud Computing and Resource Management
