When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling
Yuanhang Li

TL;DR
This paper investigates the effects of adaptive reward design in deep reinforcement learning for satellite scheduling, revealing a stability dilemma and introducing causal probing to understand reward influence.
Contribution
It uncovers the switching-stability dilemma in reward adaptation and introduces a causal probing method to analyze reward term impacts in LLM-guided DRL.
Findings
Near-constant reward weights outperform dynamic ones due to PPO convergence issues.
Probing reveals a +20% increase in switching penalty significantly improves performance.
MLP-based models outperform LLM fine-tuning in known and novel regimes.
Abstract
Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
