Reinforcing Language Agents via Policy Optimization with Action Decomposition
Muning Wen, Ziyu Wan, Weinan Zhang, Jun Wang, Ying Wen

TL;DR
This paper introduces a novel method called POAD that decomposes language agent optimization to the token level, improving credit assignment, learning efficiency, and generalization in complex environments.
Contribution
It proposes a new token-level decomposition approach with the BAD Bellman backup, enhancing reinforcement learning for language agents beyond prior action-level methods.
Findings
POAD outperforms existing methods in diverse testbeds.
Finer credit assignment improves learning efficiency.
Theoretical analysis confirms the effectiveness of token-level optimization.
Abstract
Language models as intelligent agents push the boundaries of sequential decision-making agents but struggle with limited knowledge of environmental dynamics and exponentially huge action space. Recent efforts like GLAM and TWOSOME manually constrain the action space to a restricted subset and employ reinforcement learning to align agents' knowledge with specific environments. However, they overlook fine-grained credit assignments for intra-action tokens, which is essential for efficient language agent optimization, and rely on human's prior knowledge to restrict action space. This paper proposes decomposing language agent optimization from the action level to the token level, offering finer supervision for each intra-action token and manageable optimization complexity in environments with unrestricted action spaces. Beginning with the simplification of flattening all actions, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
