Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control

Hazim Alzorgan; Abolfazl Razi

arXiv:2505.09029·cs.AI·May 15, 2025

Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control

Hazim Alzorgan, Abolfazl Razi

PDF

Open Access

TL;DR

This paper introduces Monte Carlo Beam Search (MCBS), a hybrid exploration method combining beam search and Monte Carlo rollouts with actor-critic algorithms to improve policy convergence and sample efficiency in continuous control tasks.

Contribution

The paper presents MCBS, a novel hybrid exploration technique that enhances actor-critic reinforcement learning by integrating structured look-ahead search with existing algorithms like TD3.

Findings

01

MCBS outperforms TD3, SAC, PPO, and A2C in sample efficiency and convergence speed.

02

MCBS achieves 90% of maximum reward in fewer timesteps compared to baseline methods.

03

Hyperparameter analysis and adaptive strategies improve MCBS performance in complex environments.

Abstract

Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient (TD3), depend on basic noise-based exploration, which can result in less than optimal policy convergence. In this study, we introduce Monte Carlo Beam Search (MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts with TD3 to improve exploration and action selection. MCBS produces several candidate actions around the policy's output and assesses them through short-horizon rollouts, enabling the agent to make better-informed choices. We test MCBS across various continuous-control benchmarks, including HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency and performance compared to standard TD3 and other baseline methods like SAC, PPO, and A2C. Our findings emphasize MCBS's capability to enhance policy learning through structured look-ahead search while ensuring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Clipped Double Q-learning · Adam · Experience Replay · Target Policy Smoothing · Average Pooling · Dense Connections · A2C · Twin Delayed Deep Deterministic · Global Average Pooling