LLMs for High-Frequency Decision-Making: Normalized Action Reward-Guided Consistency Policy Optimization
Yang Zhao, Zihao Li, Zhiyu Jiang, Dandan Ma, Ganchao Liu, Wenzhe Zhao

TL;DR
This paper introduces NAR-CP, a novel method for high-frequency decision-making with LLMs, using reward normalization and consistency loss to improve policy alignment and performance in UAV pursuit tasks.
Contribution
The paper proposes NAR-CP, combining reward normalization and consistency loss, to enhance LLMs' performance in high-frequency decision tasks, addressing policy misalignment issues.
Findings
Superior performance on UAV pursuit tasks
Effective generalization to unseen tasks
Improved policy alignment in composite tasks
Abstract
While Large Language Models (LLMs) form the cornerstone of sequential decision-making agent development, they have inherent limitations in high-frequency decision tasks. Existing research mainly focuses on discrete embodied decision scenarios with low-frequency and significant semantic differences in state space (e.g., household planning). These methods suffer from limited performance in high-frequency decision-making tasks, since high-precision numerical state information in such tasks undergoes frequent updates with minimal fluctuations, and exhibiting policy misalignment between the learned sub-tasks and composite tasks. To address these issues, this paper proposes Normalized Action Reward guided Consistency Policy Optimization (NAR-CP). 1) Our method first acquires predefined dense rewards from environmental feedback of candidate actions via reward functions, then completes reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
