Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
Yuhang Zhou, Zhibin Wang, Peng Jiang, Haoran Xia, Junhe Lu, Qianyu Jiang, Rong Gu, Hengxi Xu, Xinjing Huang, Guanghuan Fang, Zhiheng Hu, Jingyi Zhang, Yongjin Cai, Jian He, Chen Tian

TL;DR
Chameleon is an adaptive fault-tolerance system for distributed training that intelligently selects recovery strategies to minimize performance loss during failures.
Contribution
It introduces a unified performance model and efficient selection mechanism for optimal fault recovery strategies in large-scale training.
Findings
Maintains within 11% performance gap post-recovery compared to failure-free training.
Achieves up to 1.229x and 1.355x higher throughput than Oobleck and Recycle.
Preserves model convergence and memory efficiency during fault recovery.
Abstract
Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
Yuhang Zhou1, Zhibin Wang1,∗, Peng Jiang1, Haoran Xia1, Junhe Lu1,
Qianyu Jiang1, Rong Gu1, Hengxi Xu2, Xinjing Huang2, Guanghuan Fang2,
Zhiheng Hu2, Jingyi Zhang2, Yongjin Cai2, Jian He2, Chen Tian1
1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Huawei, China
- Corresponding author: [email protected]
Abstract
Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods—redundant computation, dynamic parallelism, and data rerouting—each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, quick execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon has a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229 and 1.355 higher average throughput than Oobleck and Recycle, respectively.
I Introduction
In recent years, large language models have gained significant popularity because of their impressive capabilities in various tasks, such as natural language understanding, code generation, and dialogue systems [9, 1]. With the increasing complexity and scale of these models, long-term and large-scale training has become a common practice. For example, training the 405B Llama 3.1 model required roughly 30.8 million GPU hours [9]. The long-term, large-scale training process can be interrupted on average every few hours due to various faults [31], such as hardware failures, software bugs, or network issues. Thus, fault-tolerant training techniques are crucial for maintaining model performance and stability, especially in distributed and resource-constrained environments.
In practice, the entire fault-tolerant training process typically consists of three phases. (i) Fault-free phase: The system trains according to the given configuration when no faults occur. (ii) Fault-handling phase: When a fault occurs, training is interrupted, and the system takes measures to recover, such as restoring from checkpoints [5, 13], backup servers [33], or adjusting the configuration using only the available resources [28, 12, 7, 30]. (iii) Post-recovery phase: After recovery, due to the absence of backups and possible node loss, training efficiency may decrease compared to the fault-free phase. Given that many users operate under resource constraints, backup-free fault-tolerant solutions have become a focal point.
However, existing backup-free fault-tolerant training methods are often tailored to specific training phases, leading to inherent limitations in broader scenarios. First, redundant computation such as Bamboo [28], where each stage node in pipeline parallelism additionally stores the layers of its successor and performs gradient computation. When the successor fails, its predecessor can take over its tasks seamlessly. However, due to additional computation and memory overhead for each stage, the training throughput in the fault-free phase is reduced. Second, dynamic parallelism like Oobleck [12], which provides predefined pipeline templates. Upon failure, the system can switch to a template with fewer nodes, dynamically adjusting the parallelism to match the available training resources. Although it does not introduce additional overhead during the fault-free phase, it takes a long time to reconfigure the pipelines when a fault occurs, leading to a significant drop in throughput during the fault-handling phase. Finally, data rerouting is exemplified by Recycle [7], which reroutes the micro-batches assigned to failed nodes to peer nodes in the same pipeline stage but in different data parallel groups, without parameter reconfiguration. However, its training throughput in the post-recovery phase is related to pipeline bubbles, which can lead to inefficiencies if not managed properly. With an increasing number of failures, the degradation of performance becomes more pronounced.
Therefore, a comprehensive and general fault-tolerant system is urgently needed that is compatible with different training phases. However, achieving this goal requires overcoming several key challenges. (i) How to accurately model the performance of different fault-tolerant strategies across all phases to provide a theoretical basis for strategy selection? (ii) How to efficiently search for and determine the optimal fault-tolerant strategy and execution plan under various configurations and resource constraints? and (iii) How to further optimize communication or computation performance for different strategies and training phases to maximize overall training efficiency?
To address these challenges, Chameleon introduces a unified framework that integrates performance modeling, strategy selection, and communication optimization for fault-tolerant training. Specifically, Chameleon builds on a comprehensive performance model that covers all phases and strategies, leveraging both analytical modeling and profiling data to accurately estimate key metrics. Guided by these estimations, Chameleon efficiently searches for the optimal execution plan to maximize overall training throughput under given resource constraints. In addition, Chameleon incorporates targeted optimizations for model transfer during recovery and synchronization communication in data parallelism, further improving the efficiency of the selected strategy. With these enhancements, Chameleon can achieve efficient and stable fault-tolerant training.
The main contributions of Chameleon are as follows.
Performance Modeling. To determine the direction of performance optimization for fault-tolerant training, we first model the performance of different strategies across various training phases, thereby establishing the system’s optimization objective. In particular, the reconstruction overhead during the fault-handling phase and the training performance after recovery are key factors in our strategy selection. We also consider performance modeling in complex scenarios such as multiple failures and asymmetric parallelism, enabling performance evaluation under extreme conditions. Finally, we provide accurate estimations of per-step execution time and memory usage, offering a solid theoretical foundation for strategy search while effectively avoiding out-of-memory (OOM) issues.
Execution Plan Search. Before selecting a strategy, it is essential to ensure that each candidate strategy is configured for optimal performance. For different fault-tolerant strategies, factors such as parallelism degree, failed node distribution, model partitioning, and data allocation can significantly affect post-recovery training efficiency. To address this, Chameleon employs a heuristic search that comprehensively considers these factors while controlling search overhead, enabling rapid identification of feasible execution plans. Chameleon selects the optimal plan based on estimated performance metrics.
Communication Optimization. Changes of the execution plan not only affect pipeline computation but also inevitably impact communication. These impacts fall into two categories: (i) weight transfer communication during training reconstruction in dynamic parallelism; (ii) synchronization communication in data parallelism, which may be affected by asymmetric parallel execution. Both types of communication overhead are dynamic and, if not handled properly, can incur high costs. Therefore, we optimize both types of communication: the former is abstracted as a bipartite graph matching problem, and the latter as a graph coloring problem, achieving optimal communication performance with minimal overhead.
Chameleon was extensively evaluated in both simulated and real-world environments, utilizing a cluster of 32 Ascend 910B AI training accelerators (NPUs) [20]. Compared to fault-free training, Chameleon’s post-recovery performance maintained a gap of less than 11.00%. In simulations, Chameleon consistently outperformed baselines like Oobleck and Recycle, achieving an average throughput of 1.229 and 1.355 higher, respectively. A detailed evaluation of Chameleon’s key techniques shows that its estimator achieves an accuracy error within 8.02%. Optimized weight transfer reduces recovery time by up to 26.79%, while asymmetric communication optimization shortens step time by up to 15.44%. Memory, convergence, and scalability analyses show that Chameleon performs well in all three aspects.
II Background
In this section, we introduce the necessary background of distributed model training, state-of-the-art backup-free work for fault tolerance, and the limitations of these methods.
II-A Distributed Model Training
With the rapid growth of model sizes, distributed training leverages strategies including Data (DP) [8, 26, 34], Pipeline (PP) [27, 11, 16], Tensor (TP) [25], and Expert Parallelism (EP) [3, 24, 21]. DP involves splitting the dataset across multiple nodes, where each node processes a subset of the data and computes the gradients independently. PP divides the model into stages, with each stage running on a different node, while TP splits the model’s tensors across multiple nodes, allowing for parallel computation of large tensors. And EP, often used in the mixture of experts (MoE) models [6], assigns different experts across nodes, enabling the model to handle more complex tasks by activating only a subset of experts. These parallel strategies have different communication patterns, which determine their typical deployment scenarios. Specifically, TP and EP require frequent and fine-grained communication, making them more suitable for intra-node deployment [14]. In contrast, DP and PP are more suitable for inter-node scaling due to their lower communication overhead [19, 22]. In practice, these strategies can be organized properly to achieve hybrid parallelism [17, 18]. While TP and EP are crucial for scaling model capacity and throughput, their fine-grained communication and tight cross-device coupling make fault isolation and recovery particularly challenging. Consequently, most fault-tolerant systems [2, 28, 12, 7, 15] focus on DP and PP, utilizing their natural state partitioning for flexible fault isolation with minimal disruption.
II-B Fault Tolerance in Distributed Training
Long-running large-scale distributed training jobs are susceptible to various faults [10, 11, 32], such as hardware failures, network issues, and software bugs. For example, during the 54-day Llama 3.1 training [9], there were 419 unplanned interruptions caused by faults, averaging 3 hours between each interruption. Gemini [31] noted that during the OPT-175B model training on 992 A100 GPUs, approximately 110 faults occurred over two months. Minder [4] showed that when training tasks involved about 1,000 machines, approximately 2 faults occurred daily on average, with each fault potentially halting training for several hours. To mitigate the impact of faults, various fault-tolerance strategies have been proposed. A common approach is checkpointing [5, 13], which periodically saves the model state to disk, allowing recovery from the last checkpoint in case of interruptions. Another strategy to ensure rapid recovery is warm backup [33], which maintains a pool of idle nodes ready to take over tasks at any time. However, it incurs significant resource waste due to the low utilization of standby nodes. As a result, this paper focuses on the backup-free methods, which leverage inherent redundancy and dynamic reconfiguration within the active training cluster and are orthogonal to checkpoint- and backup-based solutions.
II-C Limitations of Backup-free Methods
In practice, state-of-the-art backup-free methods can be categorized into three main categories:
- •
Redundant computation. Bamboo [28] employs redundant computation within the training pipeline for fault tolerance. Each node maintains a replica of its successor and hides the computation overhead within the pipeline bubbles, enabling rapid recovery for a node failure.
- •
Dynamic parallelism means adjusting the parallelism to adapt to the available resources. Varuna [2] proposed adjusting the pipeline and data parallelism to minimize communication overhead and recover by continuously checkpointing the model states. Oobleck [12] introduced pipeline templates to run multiple replicas of the pipeline and reconstruct the lost stage from the redundant replicas.
- •
Data rerouting. Recycle [7] exploits the consistent parameters across different pipelines and reroutes the micro-batches on the failed node to its data-parallel peers. Further, it proposed a scheduling mechanism to allow rerouted micro-batches to execute within the inherent pipeline bubbles, minimizing the throughput degradation.
We analyze the performance impact of these backup-free methods throughout the training process, which includes three critical phases: (i) Fault-free phase. A fault-tolerance policy may also introduce additional overhead when no faults occur. (ii) Fault-handling phase. This phase typically includes searching for a new execution plan and restoring the training state. (iii) Post-recovery phase. After recovering from the failure, the model’s training efficiency may be degraded. The time and memory impact of different fault-tolerance policies varies. Redundant computation minimizes fault-handling time but introduces a prohibitive overhead (nearly double the computation and memory) during the fault-free phase. Dynamic parallelism methods avoid overhead during fault-free training. The primary drawback is the high reconstruction overhead during fault-handling, such as restarting from checkpoints or copying from live replicas, especially with frequent failures. Data rerouting avoids huge reconstruction overhead by rerouting micro-batches on the fly. However, its effectiveness relies on complex scheduling to hide additional computation within pipeline bubbles. If the bubbles are highly optimized (e.g., zero-bubble parallelism [23]), the performance loss after resuming training is not negligible. In summary, existing methods often only consider performance in certain phases and lack a comprehensive solution.
The challenge hinges on balancing immediate fault-handling overhead with post-recovery performance. Data rerouting is preferable for individual failures due to its low handling cost. However, as failures accumulate, training performance after recovery can become unsatisfactory. Conversely, while dynamic parallelism has a higher reconstruction cost, it may achieve better long-term throughput post-recovery by a new optimal parallel plan. Therefore, we consider selecting the optimal fault-tolerance strategy in real time based on specific failure scenarios to ensure efficient and stable training.
III Performance Modeling
In this section, we first model the performance of targeted fault-tolerance methods, including data rerouting and dynamic parallelism, and then determine the optimization goal. Based on these, we present our system Chameleon, which adaptively selects the most suitable fault-tolerance policy.
Definition 1** (State)**
The state of a training system, denoted as includes the following information:
- •
Cluster Status*: The number of nodes in the cluster.*
- •
Execution Plan*: (i) the fault-tolerance policy, which can be either dynamic parallelism or data rerouting; (ii) the parallel configuration, including the number of data parallel groups and the number of pipeline stages ; (iii) the data distribution among different data parallel groups, where the sum equals the global batch size; (iv) the model layer distribution across pipeline stages; (v) the distribution of failed nodes across the stages. Detailed information is provided in Section IV-A.*
Therefore, when a fault occurs, the system needs to switch from the current state to a new state . Generally, considering a series of faults or recovery operations, we can define the state transition as a sequence of states .
Considering the whole training process, the number of processed samples can be calculated as follows:
[TABLE]
where is the number of samples processed in state and is the time spent in state . Moreover, the execution time of the whole training process can be calculated as:
[TABLE]
where is the time spent in transitioning from state to .
Optimization Problem. Straightforwardly, we aim to maximize the throughput of the whole training process, which can be formulated as:
[TABLE]
However, it is impractical to optimize the whole training process, as it requires knowledge of the entire training process, including when the next fault will occur. Fortunately, we observe that the duration between two faults, i.e., , remains constant111Note that for different durations, e.g., , the duration can be different, but the duration between two given faults is constant. regardless of the execution plan, and changing the execution plan only affects the throughput of the current state and the transition time .
Therefore, the optimization goal can be simplified to maximizing the throughput of each duration between two faults, which can be expressed as:
[TABLE]
Note that for the first state , its throughput can be determined by the initial configuration, which has various work to discuss, such as Alpa [35], and we omit the discussion here.
For the number of processed samples , it can be calculated as follows:
[TABLE]
where is the throughput of state .
By substituting into the optimization goal, we can obtain:
[TABLE]
Considering the in state is calculated as:
[TABLE]
where is the given global batch size in state , and is the time for each step in state .
Therefore, we can further express the optimization goal as:
[TABLE]
which mainly consists of two parts: the throughput of the next state and the effective time ratio, which represents the proportion of time spent in state relative to the total time between two faults. Regarding the throughput, since the batch size is given by users, the key to maximizing the throughput is to minimize the step time , which represents the time for each training step (a.k.a., iteration). For the effective time ratio, since the total time between two faults is , which is constant, the key to maximizing the effective time ratio is to minimize the state transition time . We need to investigate the factors that affect the state transition time and the per-step execution time . First, the fault-tolerance strategy in the execution plan allows for a choice between data rerouting and dynamic parallelism. The transition time varies significantly between these strategies. For example, data rerouting does not require training reconstruction, so the transition time is negligible. In contrast, dynamic parallelism incurs considerable transition time due to weight transfer between nodes and restart overhead. The factors affecting are more complex, involving parallel configuration, as well as distribution of failed nodes, data, and model layers.
In summary, we can abstract the problem of fault-tolerant training as follows: Given the current state , how can we determine the next state (specifically, its execution plan ) such that both the state transition time and the per-step execution time are minimized? To address this, we design the Chameleon system, which adaptively finds the most suitable execution plan in response to failures. Specifically, Chameleon tackles the following challenges.
- •
Given , how do we determine the optimal ? (§IV-A) We employ a heuristic search that rapidly explores candidate execution plans considering multiple factors, including parallelism, data, and layer distribution. We select the optimal execution plan by evaluating the estimated and of each plan.
- •
How do we minimize the state transition time ? (§IV-B) During dynamic parallelism, includes the weight transfer overhead incurred by training reconstruction, which varies with the execution plan. We abstract this process as a bipartite matching problem to minimize transfer cost, thus reducing .
- •
How do we minimize the step time after recovery ? (§IV-B) The consists of both pipeline computation time and synchronization communication time. While the former can be estimated by the estimator, the latter, especially under asymmetric parallelism, can be optimized as a graph coloring problem.
- •
How do we estimate the step time after recovery ? (§IV-C) Chameleon estimates both step time and memory usage for each candidate plan. For the former, it predicts pipeline time by applying analytical formulas for symmetric pipelines (data rerouting) and dynamic programming for asymmetric pipelines (dynamic parallelism). For the latter, it identifies the peak memory of each pipeline stage based on layer-wise profiling, ensuring that no OOMs occur.
IV System Design
In this section, we will introduce the core techniques of Chameleon and explain how they enable fast recovery and efficient training. As shown in Figure 1, Chameleon consists of three main components: a profiler, a fault detector, and a decision center. The decision center integrates a planner, estimator, and restorer to manage the recovery process. The workflow is as follows: ① Monitoring: The profiler continuously collects runtime metrics such as execution time, HBM usage, and parallel configurations from the cluster. Meanwhile, the fault detector monitors the health of each node. ② Fault Trigger: Upon detecting a failure, the fault detector immediately triggers the decision center to initiate the fault-tolerance mechanism. ③ Decision Making: For each failure, the planner generates potential execution plans considering parallelism, layer, and batch configurations. Then it queries the estimator, which uses profiling data to predict the performance and memory usage of each plan. Meanwhile, the restorer is responsible for minimizing the overhead of transferring model weights and synchronization communication in asymmetric parallelism. Based on these, the planner selects the optimal execution plan. ④ Plan Execution: The planner sends the finalized execution plan to the cluster, which isolates the failed node and resumes training with minimal disruption.
IV-A Planner
In the planner, we need to determine the execution plan that achieves the best performance.
Execution Plan Search. If data rerouting is selected, considering that pipeline bubbles during training are usually well-optimized and the proportion is limited, the upper and lower bounds of performance do not differ significantly. Therefore, when determining the execution plan, we evenly distribute the micro-batches from the failed nodes to other data-parallel peers, aiming to overlap them with bubbles as much as possible. However, if dynamic parallelism is used, the performance differences between different execution plans can be substantial. As shown in Figure 2, we need to search for the optimal execution plan based on the following factors.
- •
DP/PP Parallelism. Unlike directly reducing one DP or PP degree, our goal is to utilize all available nodes as much as possible to maximize throughput. Thus, we provide a larger parallelism search space that supports asymmetric parallelism and is not restricted by predefined PP templates as in Oobleck. Based on our experience, the new DP degree often differs from the original by less than 2. We employ a heuristic search by limiting the parallelism range to reduce the search overhead.
- •
Batch Distribution. For load balancing of different DP groups and to avoid stragglers, we first pre-allocate micro-batches according to the proportion of nodes. If there are remaining unallocated micro-batches, we recursively assign them one by one to all pipelines. If some partitions have no data finally, we reassign one micro-batch from the largest partition to fill these. This check is repeated to ensure that no partition is left idle.
- •
Layer Distribution. For a given pipeline, special attention must be paid to load balancing between stages. Therefore, we first evenly split the model layers across all stages. For the remaining layers that cannot be evenly divided, we enumerate all possible allocation schemes. After filtering out options that may cause OOM via memory estimation, we select the splitting scheme with the lowest pipeline execution time by the estimator.
Algorithm 1 illustrates the steps to identify the optimal execution plan, with serving as entry. Lines 1-7 demonstrate the function to enumerate all candidate parallel strategies (), taking into account possible failure numbers and valid DP/PP combinations. Specifically, it recursively factorizes the current number of nodes to determine the possible degrees of DP/PP that meet the required range in . In the function, for each candidate strategy in , lines 13-14 determine the balanced distribution of micro-batches () and the optimal splitting of layers into stages (). Line 15 estimates the execution time of the candidate by . The function then tracks the plan () with the lowest estimated execution time and returns it in lines 16-19.
The Planner’s overhead involves three steps: (i) Parallel strategy search iterates through valid combinations based on the divisors of the node count . The complexity is , where denotes the divisor function. Even for large clusters (e.g., ), the number of divisors is small (e.g., ). (ii) The complexity of batch distribution is linear with respect to the number of data-parallel groups, . (iii) Due to the even-distribution-first policy, layer distribution is limited to allocating the remaining layers across stages. The complexity is bounded by the binomial coefficient . Since the pipeline depth is typically small (e.g., ) in practice, the number of candidates remains in the range of ten to hundreds. Furthermore, Chameleon pre-calculates and caches the optimal plans for potential failure scenarios (e.g., loss of 1 to nodes) during the fault-free training phase. When a fault occurs, the Restorer directly retrieves the pre-computed plan from the cache, making the decision-making latency negligible.
IV-B Restorer
There is significant room for optimization of and . For the former, it suffers from the communication overhead brought by weight transfer, while the latter may suffer from asymmetric synchronous communication.
Weight Transfer. To achieve higher training throughput, we prefer to use dynamic parallelism to rearrange the training nodes. However, the high overhead of model weight transmission can become a bottleneck in practical deployment. As shown in Figure 3, the original training uses a configuration of DP = 3 and PP = 3, with 9 layers of model weights evenly distributed in each stage. When a node failure occurs, the planner identifies an execution plan, i.e., DP = 2 and PP = 4, with the model layers distributed as (2,2,2,3). To migrate the layers from the original to the new configuration, how can we minimize the amount of data transferred? We can model this as a bipartite graph matching problem.
Taking the original group and the new group as an example, there may be different correspondences between the two, resulting in varying amounts of weight data that need to be transferred. Assuming the number of remaining nodes is , we can construct an cost matrix based on the different layer distributions, where represents the cost for node to migrate to the -th node under the new plan. For example, if the first node in corresponds to the first node in , the layers change from (1,2,3) to (1,2); layer 3 can be directly discarded without any additional data transfer, so the cost is 0. However, if this node corresponds to the second node in , the layers become (3,4); in addition to discarding layers 1 and 2, layer 4 must be transferred from another node, resulting in a cost of 1. In this way, we can further use the Kuhn-Munkres algorithm [Kuhn2012Kuhn–Munkres] to compute the migration scheme with the minimum total cost.
Asymmetric Communication. During dynamic parallelism, asymmetric parallelism may introduce additional communication overhead for . As shown in Figure 4, in symmetric parallel training, AllReduce communication between DP groups occurs for the same model layer. However, when a node failure causes the PP configuration in one DP group to change, the model layers are redistributed. To ensure communication between the same layers, we need to establish asymmetric communication domains, which mainly leads to two changes as shown in Figure 4: (i) the number of communications increases (e.g., from 4 AllReduce operations to 6); (ii) there are dependencies between multiple DP communications, causing originally parallel communications to be executed serially. For example, in the new comm2, layer 4 is distributed on the first node, so comm2 must wait for comm1 on that node to finish before it can proceed.
To minimize the total time for asymmetric communication, the key is to maximize the parallelism of DP communications that do not have dependencies. We model this problem as a graph coloring problem: each model layer is treated as a vertex, and if two layers are located on the same device, there is a communication dependency, represented by an edge. The goal is to assign different colors to adjacent vertices, using the minimum number of colors. Layers with the same color can perform DP communication in parallel, and the number of colors corresponds to the number of communication rounds required. We adopt a greedy algorithm to find the minimum number of communication rounds, with a time complexity of , where is the number of model layers.
Besides model weights, our Restorer also incorporates the transfer and reconstruction of optimizer states. For ZeRO parallelism, we only focus on scenarios where state redundancy exists; in such cases, lost optimizer states are recovered by directly transferring replicas from peer nodes in other healthy DP groups, thereby ensuring lossless precision.
IV-C Estimator
During the selection of fault-tolerance policy, especially given a specific execution plan, we need to estimate its performance in terms of time and memory overhead. In practice, its time should be minimized as much as possible, while its memory usage needs to stay within the hardware limit.
Time Estimator. Dynamic parallelism. When faced with failures, the system can adjust the data and pipeline parallelism sizes to better utilize available resources. Two scenarios exist: (i) Symmetric parallelism: In Varuna and Parcae, the new parallelism configuration requires the number of nodes to be a multiple of the new DP and PP sizes, ensuring an even distribution of data and model across nodes to maximize throughput. (ii) Asymmetric parallelism: Oobleck provides predefined pipeline templates, each specifying a different number of pipeline stages. In certain parallel configurations, multiple templates may coexist, resulting in asymmetric parallelism. For an irregular number of nodes, this approach can effectively utilize all available nodes.
Different parallelism strategies yield different . Take the 1F1B pipeline parallelism as an example, as shown in Figure 5(a), the training consists of pipelines, each pipelined into stages. In each iteration, every stage processes micro-batches, with forward and backward pass times per micro-batch denoted as and , respectively. The execution of pipelines shows different patterns under symmetric and asymmetric parallelism. For the former, the step time can be calculated as follows:
[TABLE]
However, for the latter, the step time is determined by the slowest pipeline due to the synchronous update across all pipelines. This relationship is formulated as:
[TABLE]
where represents the set of all pipelines.
In asymmetric parallelism, estimating the execution time of each pipeline is not straightforward. First, uneven data distribution across pipelines leads to different numbers of micro-batches for each pipeline. Meanwhile, unbalanced layer allocation among stages within the same pipeline directly affects the forward and backward times of each stage. As illustrated in Figure 5(b), asymmetric pipeline parallelism leads to varying forward and backward computation times across stages, resulting in numerous pipeline bubbles and making estimation challenging. We observe that the start time of the -th computation in the -th stage () depends on both the end time of the previous computation in the same stage () and the end time of its depended computation (often has the same micro-batch index) in the previous stage (). Therefore, we propose a dynamic programming algorithm to simulate the pipeline time . The general transition function during the ready phase is defined below.
[TABLE]
where denotes a mapping from the current stage index to the previous stage index depended by the current computation, for example, if the current computation is forward, , and if it is backward, ; and denotes a mapping from the current computation index to the depended computation index in the previous stage, which can be profiled from the actual execution order.
For dynamic parallelism, the state transition introduces non-negligible time overhead . This overhead consists of three parts: (i) the cost of searching for the optimal execution plan; (ii) the cost of transferring model weights; and (iii) the overhead of restarting training. It is important to note that the search can be performed in advance and overlapped with training time, while the restart overhead depends only on the training scale. However, the weight transfer cost varies with the execution plan and is difficult to predict. Detailed optimization of transmission is discussed in Section IV-B.
Data rerouting. For this policy, since no reconfiguration of the training is required, the state transition time can be considered negligible. The main time overhead occurs during post-recovery training, as some nodes take on additional computation tasks of the failed nodes, resulting in a longer per-step execution time . Recycle optimizes through techniques such as decoupled backward propagation and staggered optimization, reducing rerouting latency via sophisticated scheduling. To simplify the analysis, we assume that the computational tasks of the failed nodes are evenly distributed among the remaining functional nodes in the same DP group when a failure occurs. The per-step execution time of the 1F1B pipeline in the data rerouting under a single failure can be calculated as follows.
[TABLE]
However, it is worth noting that as the number of failed nodes increases, the per-step time after recovery also increases. Specifically, the per-step time with failed nodes is related to the distribution of failed nodes across the pipeline stages. And its calculation is shown as follows.
[TABLE]
where is the number of failed nodes in the -th stage, and the sum of is equal to the total number of failed nodes . But if any is larger than , the training cannot be recovered, and we must switch to dynamic parallelism.
Memory Estimator. To evaluate whether a given model layer splitting is appropriate, we need to determine whether OOM occurs during training. In practice, we observe that the peak memory of the -th stage appears during the steady-state phase of the pipeline. We can decompose the peak memory for the -th pipeline stage, , into two primary components: static memory, which includes memory for components whose size is relatively fixed during an iteration, such as model weights and optimizer states; and dynamic memory, which consists of the stored activations from the forward pass. Its peak depends on both the number of layers in the stage and the stage’s position in the pipeline. The final formula is as follows.
[TABLE]
where is the number of layers in the -th stage, is the average memory for the parameters of a single layer, is the average memory for the optimizer states of a single layer, is the average memory for the gradients of a single layer (), and is the average memory for the activations produced by a single micro-batch for a single layer. All of the last four parameters can be estimated by profiling.
Computational Complexity. (i) Time Estimator. For symmetric parallelism and Data Rerouting, we employ analytical formulas with complexity. For asymmetric parallelism, the complexity of the dynamic programming method is . (ii) Memory Estimator only calculates the peak memory usage from profiling, and its complexity is .
V Evaluation
In this section, we present the evaluation of our fault-tolerant system Chameleon. We will detail the experimental setup, including the cluster setup, baselines, and workloads. Following this, we will present the results obtained from our experiments, including real-world and simulation results, as well as an ablation study to understand the impacts of the core techniques of Chameleon. Finally, we will analyze the memory usage and convergence of Chameleon.
V-A Experimental Setup
We evaluated Chameleon with the following setup.
Cluster Setup. In real-world experiments, we used a cluster consisting of 32 Ascend 910B AI training accelerators [20, wróblewski2025parallelscanascendai]. The cluster is composed of 4 nodes, each equipped with 8 NPUs, and each NPU has 64 GB of memory. Note that Chameleon’s core mechanisms—execution plan search, weight transfer, and communication scheduling—are algorithmic abstractions independent of the underlying hardware. Thus, the effectiveness demonstrated on Ascend is generalizable to other platforms, such as NVIDIA GPUs. To compare the efficiency of Chameleon with other systems, we built an event-driven simulator, with details provided in Section V-B.
Baselines. Chameleon is implemented based on PyTorch 2.1.0 and Ascend chips. We compare the performance of Chameleon with the original training in the real-world cluster. Then, we include two state-of-the-art baselines, dynamic parallelism exemplified by Oobleck and data rerouting exemplified by ReCycle, in the simulator for comparison, as they are implemented on different hardware. We exclude redundant computation methods like Bamboo and traditional checkpointing as baselines, as recent literature has demonstrated their prohibitive overhead compared to these backup-free approaches [7].
Workloads. We used the 7B Llama-2 model [29] and the WikiText dataset for training. In real-world experiments, we conduct tests on 8, 16, and 32 NPUs, with parallel configurations (PP, DP, TP) including (4, 2, 1), (4, 4, 1), and (2, 2, 8). The batch size ranges from 16 to 64, while the micro-batch size is always set to 1. To ensure that Chameleon can maintain stable fault-tolerant training over long periods, the total training time lasted up to 9 hours.
V-B Experimental Results
The real-world results demonstrate the practical performance of Chameleon in the Ascend cluster, while the simulation results provide insights into its comparative performance against other systems.
Real-World Results. Figure 9 shows the performance comparison between the original training (8, 16, and 32 NPUs) and the training after failure recovery (7, 15, and 24 NPUs). After recovery, Chameleon achieves 96.22%, 89.00%, and 74.38% of the original training performance, respectively. Notably, for the 32-NPU case (TP=8), a single fault leads to the loss of 8 NPUs, resulting in a larger performance gap compared to the original. However, compared to the theoretical maximum for 24 NPUs, Chameleon reaches 99.17% of the performance, demonstrating its ability to fully utilize available resources.
Simulation Results. In simulation experiments, we evaluated Chameleon, ReCycle (data rerouting), and Oobleck (dynamic parallelism). By randomly injecting failures at a specified NPU failure rate (10% per hour), we simulated the operation of all three systems for 9 hours of training with 32 NPUs. Figure 9 shows the number of active nodes over time, while Figure 9 presents the training throughput over time. The results show that Chameleon consistently outperforms both Oobleck and ReCycle, achieving an average throughput that is 1.229 and 1.355 higher, respectively. While Oobleck exhibits significant throughput fluctuations, ReCycle’s performance steadily degrades as the number of failures increases.
V-C Performance Breakdown
To evaluate the efficiency and necessity of the core techniques in Chameleon, we performed the following tests.
Estimation Accuracy. To verify the accuracy of the estimator, we compared the estimated time with the actual training time before and after recovery in real-world experiments under different configurations. As shown in Figure 9, the estimation error is always kept within 8.02%, demonstrating the effectiveness of our estimation mechanism.
Weight Transfer. To understand the impact of weight transfer optimization on training performance, we conducted an ablation study comparing training performance with and without optimization. Taking single-node 8-card training as an example (DP=4, PP=2, TP=1), as shown in Figure 11, when the number of layers is relatively small (e.g., 4 or 8), the weight transfer optimization does not yield significant performance improvements. However, as the number of layers increases to 16, the transfer time during recovery is reduced by 32.35%. Correspondingly, the total recovery time decreases from 11.2 s to 8.2 s, representing a reduction of 26.79%, thereby enhancing overall training efficiency.
Asymmetric Communication. We conducted an ablation study to evaluate the impact of our asymmetric communication optimization on training performance. As shown in Figure 11, in the same 8-card training, the batch size varies from 16 to 64, asymmetric communication optimization significantly reduces the overhead of AllReduce and other synchronizations, maintaining a reduction rate of 35.35%. Although training performance decreases to some extent compared to fault-free scenarios, step time decreases by 11.44%, 15.44%, and 5.52%, respectively, compared to cases without optimization. Notably, when the batch size is small, the optimization effect is more pronounced due to the relatively high communication ratio.
V-D Memory Analysis
To ensure that Chameleon can handle large-scale model training without running into OOM issues, we conducted a memory usage analysis. Taking single-node 8-card training as an example, when an NPU failure occurs, the parallelism changes from symmetric parallelism (DP=4, PP=2) to asymmetric pipelines with length [2, 2, 3]. As shown in Figure 13, the peak memory per NPU in Chameleon does not increase; in fact, it decreases in some cases and remains well below the device memory limit (64GB). This is due to changes in DP and the possible increase in the number of nodes per pipeline, resulting in fewer layers assigned to each node. These results demonstrate that Chameleon can effectively manage memory usage during fault-tolerant training without OOMs.
V-E Convergence Analysis
. To verify whether Chameleon will affect the model convergence, we conducted a convergence analysis on 32 Ascend NPUs by comparing our training loss with the original training. As shown in Figure 13, during training for up to 3500 steps, the training loss of Chameleon remains almost identical to the original training, and both ultimately reach the convergence criterion ( 0.1). This indicates that Chameleon does not negatively impact model convergence.
V-F Scalability Analysis
To validate our theoretical scalability analysis, we measured the decision-making overhead across varying cluster sizes in simulation. For small to medium-scale clusters (up to 256 cards), the search latency is negligible (s). Even at large scales, the overhead remains acceptable (19.25s for 1024 cards and 40.48s for 2048 cards), while this search process can be pre-computed asynchronously.
VI Conclusion
This paper presents Chameleon, a novel fault-tolerance system for distributed training that adaptively selects recovery policies based on real-time system states. The core components of Chameleon —the Planner, Estimator, and Restorer—enable fast and comprehensive policy search, accurate performance modeling and estimation, and low-overhead communication optimization. These features allow Chameleon to employ the optimal execution plan, achieving more robust and efficient training compared to single-strategy approaches. Our evaluations in both simulated and real-world environments demonstrate that Chameleon significantly improves training efficiency and fault tolerance, while ensuring good memory management, model convergence, and scalability.
Acknowledgment
This work was supported by the Key Program of the National Natural Science Foundation of China under Grant Nos. 62325205 and 62502198, the Natural Science Foundation of Jiangsu Province under Grant Nos. BK20243053 and BK20251224, the Nanjing “U35” Talent Cultivation Program (No. U (2024) 001), and the Nanjing University-China Mobile Communications Group Co., Ltd. Joint Institute.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report . ar Xiv preprint ar Xiv:2303.08774 . Cited by: §I .
- 2[2] S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra (2022) Varuna: scalable, low-cost training of massive deep learning models . In Proceedings of the Seventeenth European Conference on Computer Systems , pp. 472–487 . Cited by: 2nd item , § II-A .
- 3[3] W. Cai, J. Jiang, L. Qin, et al. (2025) Shortcut-connected expert parallelism for accelerating mixture-of-experts . External Links: 2404.05019 , Link Cited by: § II-A .
- 4[4] Y. Deng, X. Shi, Z. Jiang, X. Zhang, et al. (2025-04) Minder: faulty machine detection for large-scale distributed model training . In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25) , Philadelphia, PA , pp. 505–521 . External Links: ISBN 978-1-939133-46-5 , Link Cited by: § II-B .
- 5[5] A. Eisenman, K. K. Matam, S. Ingram, et al. (2022-04) Check-N-Run: a checkpointing system for training deep learning recommendation models . In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , Renton, WA , pp. 929–943 . External Links: ISBN 978-1-939133-27-4 , Link Cited by: §I , § II-B .
- 6[6] W. Fedus, B. Zoph, and N. Shazeer (2022-01) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity . J. Mach. Learn. Res. 23 ( 1 ). External Links: ISSN 1532-4435 Cited by: § II-A .
- 7[7] S. Gandhi, M. Zhao, A. Skiadopoulos, and C. Kozyrakis (2024) Re Cycle: resilient training of large dnns using pipeline adaptation . In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pp. 211–228 . Cited by: §I , §I , 3rd item , § II-A , § V-A .
- 8[8] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, et al. (2018) Accurate, large minibatch sgd: training imagenet in 1 hour . External Links: 1706.02677 , Link Cited by: § II-A .
