From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge
Xiefeng Wu

TL;DR
This paper introduces Q-shaping, a method that uses large language models to initialize Q-values, significantly improving sample efficiency and outperforming reward shaping methods across various environments.
Contribution
The paper proposes Q-shaping as a novel, unbiased approach to incorporate domain knowledge via LLM-guided Q-value initialization, enhancing reinforcement learning performance.
Findings
Q-shaping improves sample efficiency by 16.87% over baselines.
Q-shaping achieves 253.80% better performance than LLM-based reward shaping.
Q-shaping is general, robust, and guarantees optimality.
Abstract
Q-shaping is an extension of Q-value initialization and serves as an alternative to reward shaping for incorporating domain knowledge to accelerate agent training, thereby improving sample efficiency by directly shaping Q-values. This approach is both general and robust across diverse tasks, allowing for immediate impact assessment while guaranteeing optimality. We evaluated Q-shaping across 20 different environments using a large language model (LLM) as the heuristic provider. The results demonstrate that Q-shaping significantly enhances sample efficiency, achieving a \textbf{16.87\%} improvement over the best baseline in each environment and a \textbf{253.80\%} improvement compared to LLM-based reward shaping methods. These findings establish Q-shaping as a superior and unbiased alternative to conventional reward shaping in reinforcement learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
