Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization
Yanning Dai, Yuhui Wang, Dylan R. Ashley, J\"urgen Schmidhuber

TL;DR
This paper introduces Stackelberg PPO, a novel co-design method that models morphology-control coupling as a Stackelberg game, leading to more stable and efficient optimization in robotics design.
Contribution
It presents a new game-theoretic approach for morphology-control co-design, explicitly modeling control adaptation dynamics to improve optimization stability and efficiency.
Findings
Stackelberg PPO outperforms standard PPO in stability.
The method achieves higher final performance.
It enables more efficient robotics designs.
Abstract
Morphology-control co-design concerns the coupled optimization of an agent's body structure and control policy. This problem exhibits a bi-level structure, where the control dynamically adapts to the morphology to maximize performance. Existing methods typically neglect the control's adaptation dynamics by adopting a single-level formulation that treats the control policy as fixed when optimizing morphology. This can lead to inefficient optimization, as morphology updates may be misaligned with control adaptation. In this paper, we revisit the co-design problem from a game-theoretic perspective, modeling the intrinsic coupling between morphology and control as a novel variant of a Stackelberg game. We propose Stackelberg Proximal Policy Optimization (Stackelberg PPO), which explicitly incorporates the control's adaptation dynamics into morphology optimization. By modeling this intrinsic…
Peer Reviews
Decision·ICLR 2026 Poster
1. The formulation of morphology–control co-design as a phase-separated Stackelberg game is intuitive and matches the causal structure between design and control. 2. The paper gives a clear algorithmic pipeline that can be implemented with existing PPO infrastructure. 3. Experimental results are consistent across several benchmarks and ablations are reasonably detailed.
1. The technical novelty is modest, since the likelihood-ratio surrogate, natural gradient, and PPO clipping are all existing techniques; the contribution lies mostly in integrating them coherently. 2. the cross-derivative estimator in eq. (6) seems to have very high variance. 3. The baselines do not include prior Stackelberg RL algorithms (e.g., Stackelberg Actor-Critic), so it is unclear whether the improvement arises from the Stackelberg formulation itself.
- This paper proposes a novel formulation of the morphology co-design problem, where the gradient from the morphology design phase is not differentiable. The Stackelberg game formulation is interesting and well-motivated. This avoids ad-hoc joint updates and gives a clean bilevel control–morphology structure. - The derivation for a Phase-Separated Stackelberg Markov Game’s **likelihood-ratio surrogate** with a **Fisher preconditioner** creates a tractable gradient signal for the leader witho
- Most evaluations reduce to forward-velocity rewards on stylized locomotion creatures; even the new stair tasks keep the same simple progress reward. This makes it hard to assess whether Stackelberg coupling helps with *real* co-design constraints (payload, torque limits, manufacturability, sensor placement, power, robustness). - The λ/Fisher ablations are informative, but I’d like visibility into **data efficiency**: how many follower steps are actually saved vs PPO co-design? Since the thesis
1. **Clear problem framing for non-differentiable interfaces.** The phase-separated SMG formalization precisely matches co-design realities (discrete edits then control), addressing why prior Stackelberg methods fail here. Treating co-design as leader–follower nicely explains why joint training can wobble or fail. 2. **Technically neat gradient path.** The trajectory likelihood-ratio surrogate for the cross-derivative (Theorem 1) avoids differentiating through morphology transitions and yields
1. There are concerns about the experiment outlines: - Prior Stackelberg RL (e.g., Stackelberg actor-critic / policy-gradient) is acknowledged, but there’s no controlled ablation that swaps in those estimators under the same phase-separated setting to isolate what PPO-clipping contributes vs. the SID term itself. - Only four seeds are used “due to cost”; yet the authors claim “+22.1% average, +31.9% on 3D”, which needs more statistical significance. These environments are not that complicated to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · 3D Shape Modeling and Analysis
