# Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

**Authors:** Yuhang Zhou, Zhibin Wang, Peng Jiang, Haoran Xia, Junhe Lu, Qianyu Jiang, Rong Gu, Hengxi Xu, Xinjing Huang, Guanghuan Fang, Zhiheng Hu, Jingyi Zhang, Yongjin Cai, Jian He, Chen Tian

arXiv: 2508.21613 · 2026-04-21

## TL;DR

Chameleon is an adaptive fault-tolerance system for distributed training that intelligently selects recovery strategies to minimize performance loss during failures.

## Contribution

It introduces a unified performance model and efficient selection mechanism for optimal fault recovery strategies in large-scale training.

## Key findings

- Maintains within 11% performance gap post-recovery compared to failure-free training.
- Achieves up to 1.229x and 1.355x higher throughput than Oobleck and Recycle.
- Preserves model convergence and memory efficiency during fault recovery.

## Abstract

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21613/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21613/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/2508.21613/full.md

---
Source: https://tomesphere.com/paper/2508.21613