Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Devdhar Patel; Hava Siegelmann

arXiv:2410.08979·cs.LG·July 29, 2025

Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Devdhar Patel, Hava Siegelmann

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

This paper introduces Sequence Reinforcement Learning (SRL), a novel algorithm that enables effective control at lower decision frequencies by learning action sequences with a model-based approach, reducing sample complexity and maintaining high performance.

Contribution

SRL is the first to combine model-based and model-free methods with a temporal recall mechanism for low-frequency decision control in continuous tasks.

Findings

01

SRL achieves comparable performance to state-of-the-art algorithms.

02

SRL significantly reduces actor sample complexity.

03

SRL outperforms traditional RL in Frequency-Averaged Score (FAS).

Abstract

Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a "temporal recall" mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The research motivation of the paper, specifically the comparison of decision patterns and frequencies between RL and humans, is compelling. - The paper establishes a strong connection to biological fundamentals, providing relevant examples and insights throughout. - The main idea—designing an RL algorithm inspired by biological principles, where each component operates at different frequencies—is novel, and the storyline leading up to the experiment section is smooth and easy to follow.

Weaknesses

## Lack of related works - This paper primarily focuses on its connection to biological insights and motivation but overlooks relevant efforts in the RL literature addressing frame-skipping, action repetition, long-horizon exploration, and action correlations. Several studies, such as [1,2,3,4], have explored similar topics from different perspectives. Although these works are not biologically motivated, their contributions are highly relevant to this paper and should not be ignored. - The way

Reviewer 02Rating 8Confidence 2

Strengths

The introduced SRL method is explained well and is easy to understand while also showing significant improvement in the introduced FAS score. Experimental evaluation is good and the usefulness of the FAS score is adequately demonstrated. The proposed SRL framework is well motivated using recent neuroscientific discoveries.

Weaknesses

Demonstrating the improvement in sim-to-real transfer would add to the quality of the paper. Likewise, including methods that use action repetition and macro-actions could be an interesting addition. Minor mistakes: - Line 104: demonstrate**s** - Line 376: performance of the policy **in** when the frequency is not constant - Line 535: we introduce**s** the Frequency-Averaged-Score (FAS) metric Finally, I can't say that I fully agree with the statement "simple tasks like walking can be performe

Reviewer 03Rating 6Confidence 4

Strengths

The paper investigates an interesting problem, considering learning of temporally consistent open-loop action sequences that require fewer queries to the policy during deployment. While the approach is reliant on the quality of the underlying learned model, this enables learned MPC-like control at one end (deterministic 1-step) while also accommodating stochastic sampling periods by using open-loop actions until the next query is possible. The authors evaluate the method across several environme

Weaknesses

- The selection of baselines and or environments should be expanded, as the paper makes a quantitative argument of performance. Other model-free agents (e.g. D4PG, etc.) as well as model-based agents would help to better put the results into perspective. It could furthermore be interesting to run these experiments on the DeepMind Control suite as well, as Gym and DMC tasks have interesting differences in their underlying dynamics and resulting agent performance. - It would be very informative to

Code & Models

Repositories

dee0512/Temporally-Layered-Architecture
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization · Iterative Learning Control Systems