E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Minghua He, Chiming Duan, Zhaoyang Liu, Bolin Ding, Ying Li

TL;DR
This paper presents E2E-REME, an end-to-end reinforcement fine-tuned model for microservice auto-remediation, outperforming existing LLM-based approaches in accuracy and efficiency.
Contribution
It introduces a new task, E2E-MR, and a benchmark, MicroRemed, along with a novel auto-remediation model trained via experience-simulation reinforcement fine-tuning.
Findings
E2E-REME achieves higher accuracy than nine baseline LLMs.
E2E-REME demonstrates improved efficiency in microservice failure recovery.
The benchmark MicroRemed enables comprehensive evaluation of auto-remediation methods.
Abstract
Contemporary microservice systems continue to grow in scale and complexity, leading to increasingly frequent and costly failures. While recent LLM-based auto-remediation approaches have emerged, they primarily translate textual instructions into executable Ansible playbooks and rely on expert-crafted prompts, lacking runtime knowledge guidance and depending on large-scale general-purpose LLMs, which limits their accuracy and efficiency. We introduce \textit{End-to-End Microservice Remediation} (E2E-MR), a new task that requires directly generating executable playbooks from diagnosis reports to autonomously restore faulty systems. To enable rigorous evaluation, we build \textit{MicroRemed}, a benchmark that automates microservice deployment, failure injection, playbook execution, and post-repair verification. We further propose \textit{E2E-REME}, an end-to-end auto-remediation model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
