VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study
Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang

TL;DR
This paper introduces Verbalized Action Masking (VAM), a novel method for improving exploration in reinforcement learning of large language models by verbalizing action constraints, demonstrated through a chess case study.
Contribution
VAM enables controllable exploration by verbalizing action masks, and iterative pruning enhances learning efficiency and performance in chess RL tasks.
Findings
VAM improves learning efficiency in chess RL.
VAM enhances final performance over strong baselines.
Verbalized masking is practical for controllable exploration.
Abstract
Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Robot Manipulation and Learning
