Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-Relative Perturbation
Hongyu Cao, Jinghan Zhang, Kunpeng Liu, Dongjie Wang, Feng Xia, Haifeng Chen, Xiaohua Hu, Yanjie Fu

TL;DR
Sim2Act is a novel framework that enhances simulation-to-decision learning by calibrating simulator errors and stabilizing policies against uncertainties, leading to more reliable decision-making in critical applications.
Contribution
It introduces adversarial calibration and group-relative perturbation strategies to improve robustness in simulation-to-decision learning, addressing both simulator and policy uncertainties.
Findings
Improved robustness in supply chain benchmarks.
More stable decision performance under perturbations.
Enhanced alignment of simulation errors with decision impact.
Abstract
Simulation-to-decision learning enables safe policy training in digital environments without risking real-world deployment, and has become essential in mission-critical domains such as supply chains and industrial systems. However, simulators learned from noisy or biased real-world data often exhibit prediction errors in decision-critical regions, leading to unstable action ranking and unreliable policies. Existing approaches either focus on improving average simulation fidelity or adopt conservative regularization, which may cause policy collapse by discarding high-risk high-reward actions. We propose Sim2Act, a robust simulation-to-decision framework that addresses both simulator and policy robustness. First, we introduce an adversarial calibration mechanism that re-weights simulation errors in decision-critical state-action pairs to align surrogate fidelity with downstream decision…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Decision-focused calibration is well motivated: it aims at reducing errors that matter for downstream decisions rather than global fit. - The group-relative objective is simple and critic-free, echoing GRPO ideas where a group average serves as the baseline.
- Positioning vs robust RL is loose. The method is not a two-player min–max robust MDP or RARL; it optimizes relative performance over a sampled ensemble. Direct comparisons to robust MDPs and RARL/EPOpt are missing. - Uncertainty modeling is narrow: latent Gaussian noise from the encoder. There is no analysis of distributional misspecification (heavy tails, multimodality) or sensitivity to the number/scale of perturbations - Robustness baselines and metrics under true worst-case or tail risk (C
The paper addresses an important problem of robustness in Sim2Dec learning, which is crucial for digital twin applications and model-based RL under noisy or biased environments. The idea of emphasizing simulator errors in decision-critical regions is intuitively meaningful and could inspire future work in coupling model fidelity and policy robustness. The paper includes both synthetic and real-world experiments (DataCo, GlobalStore, and OAS), which help demonstrate practical applicability.
Bridging the gap between simulation accuracy and decision robustness is important for digital twin applications. However, the novelty and empirical strength of the proposed approach appear limited. The adversarial calibration component, while conceptually sound, resembles the mechanism used in Sim2Dec, and the distinction between the two frameworks is not clearly articulated. Section 3.2 in particular reads as a close variant of Sim2Dec’s adversarial training procedure. From the experimental res
1. The paper grounds the Sim2Act framework in practical, high-stakes domains such as supply chains, power grids, and robotics, where inherent noise, uncertainty, and the cost/risk of real-world interaction pose significant challenges. This clear and realistic context enhances the work's credibility and applied relevance. 2. The experimental validation is robust and well-structured, utilizing three distinct real-world supply chain datasets: DataCo, GlobalStore, and OAS. The results consistently
1. Limited Experimental Scope: The paper claims broad applicability to high-stakes domains like robotics and power grids, but all experiments are confined to discrete-action, logistics-focused supply chain datasets. The generalizability of the method to complex, high-dimensional continuous control problems or systems with non-stationary dynamics remains unproven. 2. Lack of Qualitative Policy Analysis: While the quantitative results are strong, the evaluation lacks depth in analyzing policy beh
- The paper provides an interesting insight into group-relative advantage and its role in robustness and stabilizing policy gradient updates. - The latent-space perturbations used in the adversarial simulator and group-relative perturbations are an interesting idea. It seems reasonable that this way of viewing perturbations would produce more semantically meaningful perturbations with respect to the task.
#### Notation - The paper appears to use reinforcement learning (RL) terminology and ideas (i.e., advantage, state/action distributions), but does not define or discuss them explicitly. Doing so (for instance, by discussing the underlying MDP of the simulator and dataset) would likely yield more thorough theoretical results and greatly aid in mapping this work to prior work. #### Experiments - The table captions should be more descriptive. Particularly, the metrics in Table 1 are not explained.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Simulation Techniques and Applications
