Bootstrap Policy Iteration for Stochastic LQ Tracking with Multiplicative Noise
Jiayu Chen, Zhenhui Xu, Xinghu Wang

TL;DR
This paper introduces a bootstrap policy iteration algorithm for optimal tracking control of stochastic linear systems with multiplicative noise, enabling model-free reinforcement learning and data-driven computation of control gains.
Contribution
It develops a two-phase bootstrap policy iteration method and a data-driven off-policy RL approach for stochastic LQ tracking, including a novel method for systems with state-dependent noise.
Findings
Converges to the optimal feedback gain under interval excitation.
Ensures stability and optimality in stochastic LQ tracking.
Validated through numerical examples demonstrating effectiveness.
Abstract
This paper studies the optimal tracking control problem for continuous-time stochastic linear systems with multiplicative noise. The solution framework involves solving a stochastic algebraic Riccati equation for the feedback gain and a Sylvester equation for the feedforward gain. To enable model-free optimal tracking, we first develop a two-phase bootstrap policy iteration (B-PI) algorithm, which bootstraps a stabilizing control gain from the trivially initialized zero-value start and proceeds with standard policy iteration. Building on this algorithm, we propose a data-driven, off-policy reinforcement learning approach that ensures convergence to the optimal feedback gain under the interval excitation condition. We further introduce a data-driven method to compute the feedforward using the obtained feedback gain. Additionally, for systems with state-dependent noise, we propose a…
| Case | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| Case | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization
Bootstrap Policy Iteration for Stochastic LQ Tracking with Multiplicative Noise
Jiayu Chen
Zhenhui Xu
and Xinghu Wang Institute of Science Tokyo. (e-mail: [email protected].)University of Science and Technology of China.
Abstract
This paper studies the optimal tracking control problem for continuous-time stochastic linear systems with multiplicative noise. The solution framework involves solving a stochastic algebraic Riccati equation for the feedback gain and a Sylvester equation for the feedforward gain. To enable model-free optimal tracking, we first develop a two-phase bootstrap policy iteration (B-PI) algorithm, which bootstraps a stabilizing control gain from the trivially initialized zero-value start and proceeds with standard policy iteration. Building on this algorithm, we propose a data-driven, off-policy reinforcement learning approach that ensures convergence to the optimal feedback gain under the interval excitation condition. We further introduce a data-driven method to compute the feedforward using the obtained feedback gain. Additionally, for systems with state-dependent noise, we propose a shadow system-based optimal tracking method to eliminate the need for probing noise. The effectiveness of the proposed methods is demonstrated through numerical examples.
{IEEEkeywords}
Optimal tracking control, stochastic system, policy iteration, reinforcement learning
1 Introduction
Reinforcement learning (RL), as a prominent subfield of machine learning, has become increasingly influential for tackling complex optimization and decision-making problems within uncertain environments. RL enables agents to learn optimal strategies through iterative interactions with their environment, guided by feedback in the form of rewards or penalties. This capability has driven extensive research in optimal control field, resulting in notable practical applications and theoretical progress [1, 2]. Two fundamental iterative methods within RL are value iteration (VI) [3, 4, 5] and policy iteration (PI) [6, 7, 8]. VI iteratively updates the value function based on the Bellman optimality principle, ensuring convergence but often slowly. PI alternates between policy evaluation and improvement phases, typically achieving quadratic convergence, but requires an initial stabilizing policy. Enhanced algorithms, such as the -PI algorithm [9, 10], the homotopy-based PI algorithm [11, 12], and hybrid or composite methods [13, 14], have been developed to mitigate these limitations and improve convergence rates. Moreover, efficient exploration is essential in the development of RL-based optimal control to ensure the uniqueness of solutions to data-based iterative equations. This requirement is related to the concept of persistence of excitation (PE) condition [8, 15]. Without the PE condition, RL algorithms fail to converge to the optimal control policy. Recent research has introduced a more relaxed alternative, the interval excitation (IE) condition, implemented through filter-based PI and VI algorithms [16, 17]. Unlike PE, IE only requires sufficient excitation over certain finite intervals. In practice, both PE and IE are usually achieved by injecting probing noise into the input channel during the learning process. Designing this excitation requires carefully balancing the need for exploration with the goal of maintaining system stability and performance. This complex trade-off lies at the core of the exploration-exploitation dilemma (see [18]) and remains an open research topic in RL-based control design.
Among the various optimal control problems, the linear quadratic tracking problem (LQTP) for continuous-time (CT) systems has recently attracted renewed interest as a benchmark for evaluating RL-based control techniques. The objective of LQTP is to guide a linear system along a predetermined trajectory while minimizing a quadratic cost function. Traditional solutions in the RL-based optimal tracking control reformulate LQTPs into equivalent regulation problems by augmenting the state space and redefining the cost function, often in discounted forms. This key transformation, initially proposed in foundational works [19, 20], enables the straightforward application of existing RL methodologies but significantly increases system dimensionality and computational complexity. Subsequent research has developed diverse RL-based optimal tracking methods, including model-free approaches without discount factors [21], output-feedback controllers for incomplete state information [22, 23], off-policy algorithms for external disturbances [24, 25], and exponential tracking controllers using output regulation theory [26, 27].
While most of the existing literature on RL-based optimal control focuses on deterministic systems, stochastic systems, particularly involving multiplicative noise, offer a richer and more realistic modeling framework. This framework is relevant in diverse application areas, including financial systems [28] and networked control systems [29]. However, RL methods tailored for this type of system are scarce, largely due to the analytical complexities introduced by multiplicative noise. These complexities lead to the lack of closed-form solutions and require non-trivial extensions of classical analysis tools like the Popov–Belevitch–Hautus test. Recent foundational advances have addressed these challenges [30, 31, 32, 33, 34], enabling initial developments of on-policy RL algorithms [35, 36] and off-policy RL algorithms [37, 38] for stochastic linear quadratic problems. Further advancements integrating state augmentation techniques with RL methods have also been introduced, providing model-free solutions to the stochastic linear quadratic tracking problem (SLQTP) [39].
Despite the notable theoretical progress brought about by recent advancements in RL for stochastic control, several significant challenges remain. This paper addresses these issues by investigating the CT-SLQTP with multiplicative noise. Firstly, the drawbacks of PI-based methods (i.e., the need for initial stabilizing policies) and VI-based approaches (i.e., slow convergence rates) still exist. But the enhanced methods developed for deterministic systems (e.g. [11, 12, 14]) are difficult to extend directly to stochastic scenarios due to the complexities introduced by multiplicative noise. Secondly, in data-driven methodologies, traditional state augmentation approaches (see [19, 39]) to solving SLQTP significantly increases system dimensionality, making iterative computations more challenging and complicating the fulfillment of necessary IE conditions. Thirdly, a probing noise is typically required in data-driven methods to ensure PE/IE conditions. However, the introduction of probing noise may cause unnecessary oscillations even risk destabilization. Thus, we explore whether full-rank conditions can still be established without probing noise.
To address these challenges, we formulate the SLQTP under an average cost criterion. Utilizing the calculus of variations, we derive an optimal tracking solution that includes a feedback (FB) gain determined by solving a stochastic algebraic Ricatti equation (SARE) matching the original system’s dimension and a feedforward (FF) gain computed through a Sylvester equation. Next, we introduce a designed parameterized system to remove the need for initial stabilizing control gain. Based on this system, we propose a B-PI method that allows convergence from a zero initialization. By leveraging the inherent mean-square stability of the parameterized system, we subsequently develop a novel off-policy RL method incorporating IE conditions, which uses trajectories of the parameterized system. Furthermore, for the scenario where the system volatility depends only on the state, we introduce shadow systems that enable the data-driven matrices to satisfy the IE conditions without the need for probing noise.
Within this proposed SLQTP framework, our main contributions are summarized as follows.
(1) A novel bootstrap PI algorithm is proposed to eliminate the requirement for the initial stabilizing condition. We demonstrate analytically that the stabilizing control gain can be obtained within finite steps, subsequently guaranteeing convergence to the optimal solution.
(2) We further propose an innovative data-driven optimal tracking control algorithm that completely eliminates the need for explicit system dynamics. This method integrates an off-policy RL strategy for FB learning and a data-driven solution for FF computation.
(3) A new shadow system-based optimal tracking method specifically for systems with state-dependent noise is introduced. This methodology ensures necessary full-rank conditions even in the absence of control inputs, thereby avoiding the need for probing noise.
This paper is organized as follows. Section 2 describes the problem formulation and presents basic results for the SLQTP. In section 3, a B-PI framework for solving the SLQTP is introduced, along with convergence analysis. A data-driven optimal tracking design method is then developed, which eliminates the need for prior knowledge of the system dynamics. Finally, to avoid using probing noise while still satisfying the required IE condition, a shadow system-based learning approach is proposed. Section 4 presents simulation examples to demonstrate the effectiveness of the proposed approach. Finally, Section 5 concludes the paper.
The notations used in this paper are introduced as follows. Let and be a symmetric matrix. We define the following operators: ; ; . Let and be families of random variables. For a given time interval , we define the following terms: ; ; ; . Let denote the Moore-Penrose inverse of a full column-rank matrix . For any random vector , we denote its expectation by .
2 Problem formulation and preliminaries
2.1 Problem formulation
Consider a CT stochastic linear system governed by the Itô-type stochastic differential equation (SDE)
[TABLE]
where , , and represent the system state, control input, and system output, respectively. denotes a scalar standard Brownian motion defined on the filtered probability space with . The constant matrices have compatible dimensions and are assumed to be unknown.
We introduce the standard assumption for system (1).
Assumption 1
System (1) or is mean-square stabilizable and the pair is exactly detectable.
Remark 1
Mean-square stabilizability guarantees the existence of a FB gain such that the closed-loop system is asymptotically mean square stable (see, e.g., [36, 40]). Exact detectability, a weaker concept than stochastic detectability, was introduced in [34, Definition 3.1], along with a necessary and sufficient condition known as the stochastic Popov–Belevitch–Hautus criterion [34, Theorem 3.1]. These properties are fundamental to designing and analyzing of optimal control strategies for stochastic linear systems [30, 41]. They also play a critical role in the subsequent theoretical developments.
The control objective is to design a control policy such that the system output tracks a desired reference trajectory , which is generated by an autonomous linear system
[TABLE]
Here, and are also unknown.
Remark 2
The reference system can be designed to be either asymptotically stable or marginally stable. In this work, we consider the more general case of marginal stability, which generates bounded trajectories. This formulation accommodates a broad class of reference signals, including unit steps, sinusoids, and their combinations. Furthermore, varying allows for the synthesis of a diverse family of trajectories from the same underlying ODE, as demonstrated in the simulation section.
To formalize this structural property, we introduce the following assumption for the reference generator.
Assumption 2
All eigenvalues of the matrix have zero real parts i.e., , , and their algebraic and geometric multiplicities are equal.
To quantify the tracking performance and penalize control effort, we introduce the infinite-horizon average cost functional
[TABLE]
where
[TABLE]
, and .
Definition 1** (Admissible control policy)**
A control policy is said to be admissible, denoted by , if it satisfies the following properties: i). ; ii). system (1) with is asymptotically mean-square stable; iii). .
For the controlled system (1) and the reference system (2), it is aimed to determine a control policy that minimizes the infinite-horizon average cost functional defined in (4).
2.2 Optimality analysis
We derive the optimal solution of the above SLQTP by applying the calculus of variations directly. First, given matrices from and matrix , we introduce the generalized Lyapunov operator defined as
[TABLE]
with its spectrum given by
[TABLE]
Let denote the optimal control with associated state trajectory . We examine perturbed trajectories around this optimal pair , taking the form and , with small . Here, and evolves according to with . The cost functional (3) expands as
[TABLE]
where and . Given and , it follows immediately that .
To derive the first-order necessary conditions, introduce a costate process satisfying , where and will be specified later. Consider the term , which equals to due to . Applying Itô’ s formula, we rewrite this term as . Thus, we have the following equality . Substituting this into yields
[TABLE]
where f_{1}=(H^{\top}Q(Hx^{*}-H_{d}x_{d})+\boldsymbol{f}+A^{\top}\boldsymbol{p}+C^{\top}\boldsymbol{q}\big{)} and . Following the sweep method [30, 42], the costate is expressed as
[TABLE]
with and , leading to
[TABLE]
Additionally, the terminal term of (6) is bounded by
[TABLE]
Since and is bounded, this gives rise to .
Setting the first variation to zero for all admissible perturbations implies the optimality conditions and . By substituting Eqs. (7) and (9) into , we obtain . Since the matrix is positive definite, we can uniquely solve for the optimal control policy as
[TABLE]
where
[TABLE]
Further substituting Eqs. (7)-(9) and the obtained optimal control policy into the condition leads to the equality , which holds for all . Consequently, we deduce the following euquations
[TABLE]
[TABLE]
where Eq. (14) is a SARE and Eq. (15) is a Sylvester equation. Next, we establish sufficient conditions for the optimal solution based on these two equations.
Theorem 1
Suppose Assumptions 1 and 2 hold. Let and be the solutions of (14) and (15), respectively. Then, we have .
Proof: Under Assumption 1, by [34, Theorem 4.1] there exists a unique solution to Eq (14) such that . It follows that the matrix is Hurwitz. Moreover, Assumption 2 ensures that all the eigenvalues of have zero real parts. Therefore, for all , , which implies that Eq. (15) admits a unique solution. As a result, we can conclude that . Combing this with Eqs. (6) and (10) and noting the boundedness of , we obtain . Finally, substituting this result into Eq. (5) yields .
Remark 3
This approach differs from the augmented formulations found in [19, 20, 39], which reformulate the tracking problem as a regulation problem on an augmented state of dimension . While those methods lead to an optimal control policy structurally similar to (11), they require solving a Riccati equation of size that includes a redundant block for the construction of , increasing the computational burden. In contrast, our framework separates the computation: the feedback gain is derived from a nonlinear SARE (14), while the feedforward gain comes from a linear Eq. (15). The off-policy RL algorithm we propose later is tailored to solve this SARE associated with the FB component.
3 Main results
We propose a B-PI framework to compute the optimal pair , incorporating both model-based and data-driven algorithms. Unlike conventional PI methods [36, 43], which require an initial stabilizing gain, the B-PI method introduces a novel two-phase iterative mechanism that removes this dependency.
In the first phase, B-PI bootstraps a stabilizing control gain from the zero-value start by iteratively solving a parameterized Lyapunov equation. Once stabilization is achieved, the second phase continues iterating to optimize performance with respect to the specified cost criterion. After obtaining , these are then used to design a data-based equation for computing , in place of solving model-based Eq. (15).
3.1 Useful lemmas for parameterized system
To develop these iterative schemes, both model-based and data-driven, we first establish several key results.
Let be a given constant, and consider the parameterized system with . For this system, we define the stabilizing gain set as
[TABLE]
Moreover, given , we introduce the parameterized Lyapunov equation
[TABLE]
where and are design parameters that provide flexibility in shaping the iterative process. Based on the solution , we define the following operators for feedback gain update and the update as
[TABLE]
We next present an important result, whose proof is provided in Appendix .1. In the remainder of the paper, denotes a positive definite matrix.
Lemma 1
Let for an arbitrary , and suppose that Assumption 1 holds. Let be the unique solution of Eq. (16), and let and . Then the following statements hold:
If and , then , , and . 2. 2.
If and , then and , where is the solution of Eq. (14).
Remark 4
Lemma 1 forms the analytical cornerstone for the convergence result of the B-PI algorithm along with the closed-loop asymptotic mean-square stability of the parameterized system. Specifically, when , Statement 1 of Lemma 1 is crucial for ensuring that the search for a stabilizing control gain can be completed within a finite number of steps. Furthermore, note that when , the parameterized system coincides with the original system . In this setting, choosing , corresponding to the cost functional (3), allows the iterative procedure governed by (16)-(17) to approximate the optimal solution. Statement 2 of Lemma 1 is thus key for establishing convergence.
In the subsequent two-phase scheme, the parameter will be assigned different symmetric matrices depending on the respective phase. Meanwhile, the parameter will be iteratively updated during Phase I according to (18), starting from the following initial condition.
Lemma 2
If there exist such that , then .
Proof: Given , we obtain . Thus, it follows directly that .
Remark 5
Note that the scalar can be chosen sufficiently large and sufficiently small to meet the required initial condition. Consequently, the parameterized system is inherently asymptotically mean-square stable. Compared to the restrictive assumption of the existence of a stabilizing gain matrix, this condition is easy to achieve. It plays an important role not only in the development of the model-based B-PI approach but also in the establishment of the data-driven method. In particular, within the data-driven framework, this inherent stability allows for designing the input solely based on probing noise, which is typically time-varying and sufficiently rich [8, 18]. Such a design ensures the collected data remains informative for learning, without requiring an explicit feedback control component.
To facilitate this, we define , and introduce the discounted state and output variables
[TABLE]
Then the system dynamics given in (1) is transformed into the parameterized system :
[TABLE]
Thus, the trajectories of can be interpreted as a discounted transformation (DT) of those of the original system , with the input unchanged.
Given parameters and , we can get the expected integrals , and , along the trajectories (i.e., ) over the interval , where , and construct
[TABLE]
Let be a sampling interval, and define the discrete time instants , where is the initial time for data collection. Denote the matrices
[TABLE]
We have the following result, whose proof is provided in Appendix .2.
Assumption 3
There exists an integer such that
[TABLE]
Lemma 3
Suppose Assumption 3 holds. If , then the matrix has full column rank.
Remark 6
Lemma 3 provides theoretical support for the convergence of the data-driven method. The rank condition in Assumption 3 corresponds to the IE condition. Specifically, the full rank of implies positive definiteness of , which is a Riemann sum approximation to the integral , where . Unlike the strict PE requirement, which demands sufficient energy in every sliding window, IE only asks that the signal be rich enough over the finite interval. Moreover, it is worth emphasizing that the integration interval and the sampling period are fundamentally different: specifies the time span over which data is integrated, while represents the sampling frequency and can be chosen arbitrarily small depending on the capabilities of the hardware.
3.2 Bootstrap policy iteration algorithm
With the theoretical foundation established by Lemma 1, we are now ready to formalize the B-PI algorithm, summarized in Algorithm 1. The algorithm operates in two phases: Phase I is outlined in lines 2-7, while Phase II is given in lines 8-12. Define the index . Theoretical results of this algorithm are established in Theorem 2, with detailed proof given in Appendix .3.
Throughout the subsequent analyses, we consistently initialize (or ) as the zero matrix and denote by and the unique solutions to Eqs. (14) and (12), respectively.
Theorem 2
Suppose Assumption 1 holds, and there exist constants such that . Let the sequence be generated by iteratively solving (23)-(25), and let the sequence be generated by iteratively solving (24) and (26). Then the following properties are satisfied:
* and . *
- 2)
* and .*
Remark 7
Classical PI methods (e.g. [35, 36, 44]) for CT stochastic linear systems require a stabilizing initial feedback gain. In contrast, our B-PI framework removes this restrictive requirement by introducing a parameterized system and a parameterized Lyapunov equation (16). Phase I automatically drives upward and returns a stabilizing gain in finite number of iterations even from ; Phase II continues on the original system to reach the optimal solution. Unlike [14], which enforces to obtain a monotonic sequence , our design employs a tunable weighting matrix in (23) that decouples the “stabilize‑first” stage from the task cost, thereby relaxing the requirement on . As a result, the method accommodates and extends to indefinite when combined with the result in [45].
Note that in practical implementations, the calculated control gains may be biased due to model uncertainties. To address this issue, we can directly extend the B-PI framework to explicitly account for these disturbances, and analyze the robustness of the algorithm under perturbed conditions by using [45, Theorem 6]. It demonstrates that if the perturbations remain small, the algorithm ensures convergence of the sequence to a neighborhood around the optimal solution.
3.3 Data-driven optimal tracking control
We now develop a data-driven method to sequentially determine the gain matrices and , without requiring knowledge of the system parameters, based on Algorithm 1 and Eq. (15).
To derive a data-based iterative equation, we first recall system (20) and apply Itô’s formula to to obtain
[TABLE]
Let , , and . We then substitute (24) into the above equation, and use to rearrange the expression, which gives
[TABLE]
where
[TABLE]
By substituting for in (27) and applying the iterative Eq. (23) from Phase I of Algorithm 1 to the right-hand side of Eq. (27), we obtain
[TABLE]
Similarly, replacing with and applying Eq. (26) from Phase II of Algorithm 1, the expression becomes
[TABLE]
Next, we integrate both sides of Eqs. (28) and (29) along (20) over , take expectations, and use Kronecker product representation, to obtain
[TABLE]
where , , , and .
Based on the sampling points, we construct the matrices , , and . This leads to the matrix-form representations of Eqs. (30) and (31) as
[TABLE]
Since and , and based on the data-driven matrices and theoretical foundations established in Lemma 3), we are now in the position to develop an off-policy RL algorithm as an extension of Algorithm 1. In particular, lines 4-5 of the model-based algorithm are replaced by lines 5-6 of Algorithm 2.
In addition, lines 10-11 of Algorithm 1 are modified as lines 11-12 of Algorithm 2.
We have the following results, whose proof is provided in Appendix .4.
Theorem 3
Suppose the conditions of Theorem 2 and Assumption 3 are satisfied. Let the sequence be generated by iteratively solving Eqs. (39),(40), and (25), and let the sequence be generated by iteratively solving Eqs. (42) and (40). Then the following properties hold:
* and .*
- 2)
* and .*
Remark 8
From Eq. (30) the unknowns also satisfies the integral closed form
[TABLE]
The update (39) is precisely the discrete normal equation obtained by stacking samples of and . It can therefore be viewed as a Riemann‑sum or least‑squares approximation of (34). Importantly, Eq. (39) is derived directly from the expected identity (30), rather than by numerically integrating trajectories to approximate (34). This avoids bias, which can be non‑negligible when the integral window is short or sampling is coarse.
After obtaining the optimal FB gain and matrix , we proceed to develop a data-based equation for computing the optimal FF gain . To this end, define the expected values . The dynamics of satisfy .
Using Eqs. (13) and (15), the time derivative of the term can be expressed as
[TABLE]
By integrating Eq. (35) over along the dynamics of and , and rearranging terms, we obtain
[TABLE]
where , and is given by
[TABLE]
To facilitate a matrix-based formulation, we define the data-based matrices, , , , and . We then construct the matrix , which has the following structure
[TABLE]
Therefore, the linear matrix form of Eq. (36) becomes
[TABLE]
Lemma 4
If there exists such that for all the following rank condition holds
[TABLE]
then has full column rank.
The proof is similar to that of Lemma 3 and is omitted for brevity. As a result, under this condition, the unknown parameters can be uniquely determined by solving 43.
Remark 9
As summarized in Algorithm 2, the method consists of two main components: the FB off-policy RL (lines 3–13) and the FF computation (line 15). These components are executed sequentially while utilizing the same dataset. In contrast to most existing methods that reformulate tracking problems into regulation problems of dimension [19, 20, 39], our approach strategically decomposes the FB and FF control design, significantly reducing computational burden. Moreover, the convergence analysis of the iterative procedure relies on the asymptotic mean-square stability property of the parameterized system’ s closed-loop dynamics, rather than those of the original system.
3.4 Shadow system-based RL (case )
In this subsection, we consider the special case where the matrix vanishes (). Under this condition, the term consistently vanishes, simplifying the iterative procedure by removing its calculation. Simultaneously, the term reduces to .
As previously noted in Remark 9, although the transformed dynamics are asymptotically mean-square stable, the original system may still be inherently unstable. Therefore, injecting probing noise could amplify unbounded trajectories, posing a practical risk in real-world control applications. Conversely, as emphasized in Remark 5, the input is specifically designed to include only probing noise due to this mean-square stability. In this regard, an important question arises: What would happen if no probing noise were injected into the input channel, i.e., if ?
To analyze this scenario, we rewrite Eqs. (39), (42), and (43) under conditions and , yielding
[TABLE]
Here, becomes , and the columns of , , and are given by
[TABLE]
Since and , the previously stated full-rank conditions (Lemmas 3 and 4) no longer hold. Consequently, Algorithm 2 fails in the absence of probing noise.
However, certain partially model-free, on-policy methods ([19, 46]) circumvent this limitation by resetting initial states. This observation suggests a possible solution: would knowing matrix solve this issue?
Inspired by this insight, we propose a new data-driven algorithm employing shadow systems to maintain an off-policy framework. Specifically, given a known , we define a shadow system, , for FB learning as
[TABLE]
and a shadow system, , for FF computing as
[TABLE]
From these systems, we form matrices and , with components given by
[TABLE]
, . These matrices are embedded into the original data-based matrices:
[TABLE]
The next result confirms the effectiveness of the shadow system approach. The proof follows the same structure as that of Lemma 3 and is omitted here for brevity. Our focus is on discussing the advantages of introducing shadow systems.
Corollary 1
Let and . If there exists such that for all ,
[TABLE]
then the matrices , , and have full column ranks.
As a result, unknown parameters can be uniquely determined by
[TABLE]
Combining the uniqueness result with the derivations of Eqs. (52), (53), and (54), we can conclude that the shadow system-based solutions are equivalent to their model-based counterparts. Therefore, using expectation approximations, the estimated version of shadow system-based equations can be substituted into Algorithm 2, which results in a shadow system-based optimal tracking algorithm. The structure of this algorithm is illustrated in Fig. 1.
Remark 10
The key advantages of the proposed approach are summarized below:
- 1).
The matrices and depend on the shadow systems, offering flexibility in generating excitation signals. The matrix can be chosen to make the pair controllable. This allows to be designed to excite the full system modes and achieve the required rank conditions. Additionally, the rank conditions of and can be readily achieved by resetting the system’s initial states.
- 2).
In contrast to partially model-free methods (e.g. [19, 46]), which require applying an updated control policy to the physical system to collect data at each iteration, the proposed method is off-policy. It only requires a single data collection phase prior to learning. Furthermore, the method enables the simultaneous update of both the value function and the control gain.
- 3).
During learning, neither control input/probing noise nor shadow trajectories are injected into the physical system. Shadow trajectories enter only algebraically in the iterative equations to satisfy IE, avoiding any amplification of potentially unstable on‑plant behavior.
4 Numerical Simulations
This section presents two numerical examples to validate the effectiveness, accuracy, and robustness of the proposed data-driven algorithms for SLQTP. In both cases, the system dynamics incorporate stochastic diffusion, modeled as multiplicative noise, while the drift components are represented by two well-known benchmark systems: a spring-mass-damper system (see [20]) and a two-mass-spring system (see [27]). The simulation results constitute three parts:
- 1).
Demonstration of the algorithm’s capability to learn a stabilizing FB gain directly from data, without requiring explicit system dynamics. Furthermore, the convergence and robustness properties are verified.
- 2).
Evaluation of the flexibility and accuracy of the FF computation by applying the same learned FB gain to track multiple desired trajectories. Moreover, comparative analysis against the LQTP model-based algorithm () highlights the importance of considering multiplicative noise to ensure accurate tracking.
- 3).
Examination of the shadow system-based learning method. Results confirm that the rank conditions necessary for FB learning and FF computation are satisfied by generating appropriate trajectories using shadow systems, without injecting probing noise into the original input channel.
4.1 Numerical example
The spring-mass-damper system with multiplicative noise are characterized by the following parameters
[TABLE]
The quadratic cost function employs weight matrices and . A collection of desired output trajectories is generated by a time-invariant linear system
[TABLE]
Eight different output maps for each reference case are detailed in Table 1.
Algorithm 2 parameters include , , , , and . The control input includes probing noise given by [8, 37] as , where are randomly selected. Data is collected over the interval , with sampling period and integration window . The rank conditions (22) and (38) are satisfied at the end of this phase. In order to distinguish the results of data-driven algorithms from those of model-based algorithms, the results of data-driven algorithms are represented by in the following text.
The iterative procedure starts with . During Phase I, the iterations continue until the parameter reaches the threshold . As shown in Fig. 2. (c), this condition is first met at iteration , resulting in a stabilizing FB gain matrix such that . The iterative process converges at iteration , with the final gain . The Evolution of and are illustrated in Fig. 2 (a)-(b). Accuracy of the learned gains is assessed by comparing the data-driven results with Algorithm 1 (used as ground truth). The error norms at each iteration are provided in Fig. 2 (d). The final errors are found to be small, while minor discrepancies across iterations may result from numerical approximations and data noise. It can be concluded that as long as these disturbances remain small, the parameters converge to a small neighborhood around the optimal solution.
Using the converged FB gain , FF gains are computed for all eight reference trajectories via equation (43). The results are summarized in Table 2. All FF gains are computed based on the same dataset from . To assess tracking performance, two reference tracking scenarios are constructed:
- •
Scenario 1: , with the last segment lasting seconds and the others seconds each.
- •
Scenario 2: , each lasting 5 seconds.
Here, the superscript represents case index. Fig. 3 shows that in both scenarios, the system achieves fast and accurate tracking over a 25-second horizon.
Additionally, to benchmark the SLQTP tracker’s effectiveness, we compare it to the LQTP tracker (assuming no stochastic diffusion i.e., and ). In the tracking task for case 8, Fig. 4 (a) shows the SLQTP tracking, while Fig. 4 (b) shows the LQTP result. The results indicate that while both trackers stabilize the system, the SLQTP tracker significantly outperforms the LQTP tracker, maintaining output trajectories close to the desired reference. Thus, it is important to consider multiplicative noise to ensure accurate tracking.
4.2 Numerical example
In this example, we evaluate the performance of the shadow system-based learning method in no probing noise setting. The controlled system under consideration is given by
[TABLE]
The weight matrices of the quadratic cost function are and , and the desired trajectories are the same as those in Example . Set .
To satisfy the rank conditions (50) and (51) in the absence of external excitation, auxiliary systems are introduced. Initial states are reset once and state trajectories are collected over the interval , with each trajectory spanning . The auxiliary systems are designed as
[TABLE]
The matrix is Hurwitz and controllable with , and is marginally stable. All other algorithm parameters remain the same as in Example .
Starting from and , the parameter meet the threshold at iteration , resulting in a stabilizing gain . The algorithm proceeds and converges by iteration , as shown in Fig. 5 (a)-(b). To evaluate the accuracy of the learning process, we compare the parameters and against model-based results. The error norms, provided in Fig. 5 (c), remain low throughout, demonstrating the algorithm’s robustness. Furthermore, tracking performance validation is conducted using Scenario 2 from Example I. The tracking results, shown in Fig. 5 (d), indicate that the learned control policy achieves accurate and rapid tracking performance.
5 Conclusion
This paper introduced a B-PI-based optimal tracking framework for the SLQTP in the presence of multiplicative noise. The proposed data-driven method removes the traditional requirement for an initial stabilizing control gain and explicit knowledge of the dynamical model. Furthermore, a shadow system-based optimal tracking algorithm was developed to eliminate the requirement for probing noise by assuming that volatility depends only on the system state and that the input-to-state matrix is given. The off-policy RL algorithm in this methodology can be directly applied to the stochastic linear quadratic regulation problem.
.1 Proof of Lemma 1
1). Given , we first consider Eq. (16) with . According to [47, Theorem 3.2.3], this equation admits a unique solution given by
[TABLE]
where satisfies the matrix SDE
[TABLE]
Since also implies exponential mean-square stability of the above system due to time invariance (see [48]), and since , it follows from Eq. (59) that
[TABLE]
To establish that and , we substitute the definition of from (17) into Eq. (16) and complete the squares to obtain
[TABLE]
where and . Rewriting with yields
[TABLE]
Since , , and , it clearly implies that . Moreover, we get
[TABLE]
[TABLE]
By [31, Theorem 1], we conclude that , and thus .
2). Since , we subtract Eq. (14) from Eq. (16) with and to obtain
[TABLE]
where . Together with , this equation implies that .
Next, we aim to show that . Rewriting Eq. (62) using gives
[TABLE]
where . According to [34, Theorem 3.2], since and hold, it is sufficient to prove the exact detectability of the pair to conclude that .
Suppose, for contradiction, that the pair is not exactly detectable. Then, by the stochastic Popov–Belevitch-Hautus criterion ([34, Theorem 3.1]), there exists a nonzero symmetric matrix such that
[TABLE]
From , it follows , which leads to . But since , we have , which contradicts . Thus, the exact detectability of holds. This completes the proof.
.2 Proof of Lemma 3
We prove the result by contradiction. Suppose there exists a nonzero vector , such that . Let , with , and .
Using , the definition of , and Eq. (21), we obtain
[TABLE]
Next, applying Itô’s formula to along with (20), substituting , and rearranging the terms yields
[TABLE]
where , , and
We then integrate over multiple time intervals , take expectations, use the Kronecker product representation, and construct the matrix form of the obtained linear equations. This gives
[TABLE]
Adding Eqs. (64) and (65) results in
[TABLE]
where , , .
By the rank condition (22) of Assumption 3, the matrix . It follows from Eq. (66) that , , and . As a result, , together with , which implies that the only solution is . Substituting into and yields and . Hence, , contradicting the assumption . Hence, the matrix has full column rank.
.3 Proof of Theorem 2
1). We first show that , , and hold for all by mathematical induction.
- i)
Suppose . Given , we have (Lemma 2). By Lemma 1 (Statement 1), there exist a unique positive definite solution to Eq. (23), and the solutions and of Eqs. (24) and (25), respectively, satisfy that and . 2. ii)
Assume for some , holds. Apply Lemma 1(Statement 1) again at step , Eq. (23) admits a unique solution , the updated feedback gain belongs to with .
Therefore, the sequence is strictly increasing and bounded above by . By construction, the stopping index satisfies and . Also note that since each is bounded, there exist positive constants such that and , . Applying these bounds to Eq. (25), we obtain . A summation from to gives
[TABLE]
which implies that
[TABLE]
This proves the finiteness of . Then, at iteration , Eq. (23) becomes
[TABLE]
Express it in terms of the original system and as
[TABLE]
where . The positive definiteness of directly follows from Eq. (25) and the facts that and . Finally, since , by [31, Theorem 1], it can be immediately concluded that .
- Next, we verify that for all , and by mathematical induction.
- i)
Suppose . Since , based on Lemma 1(Statement 2), we know that Eq. (26) admits a unique solution and Eq. (24) gives . 2. ii)
Assume for some , holds. Apply Lemma 1(Statement 2) again at step , the solutions to Eqs. (26) and (24) satisfy and .
To establish the convergence result, we derive a recursive identity from Eqs. (26) and (24) that
[TABLE]
Since has been established, this equation implies for all . As a result, the sequence is monotonically decreasing and bounded below by . Therefore, the limit exists, i.e., there exists such that . Under Assumption 1, the solution of (14) is unique and satisfies (26) along with satisfying (24). Thus, it can be concluded that and .
.4 Proof of Theorem 3
-
Assume . We have that the unique solution triplet to Eqs. (23)-(25) also satisfies (39), (40), and (25). On the other hand, Lemma 3 ensures that Eq. (39) has a unique solution under Assumption 3. Therefore, the iterative procedure defined by Eqs. (39), (40), and (25) is equivalent to solving (23)-(25). From the proof of Theorem 2 (Statement 1), the condition is verified for all , starting from . Hence, the conclusion of property 1 follows directly from Theorem 2 (Statement 1).
-
We now assume holds. Under this assumption, Lemma 3 ensures that Eq. (42) admits a unique solution under Assumption 3. Moreover, the unique solution pair obtained from Eqs. (26) and (24) also satisfies (42) and (40), which implies that the policy iteration defined by (42)and (40) is equivalent to solving (26) and (24). By using the obtained result and Theorem 2(Statement 2), we immediately conclude that for all , and thus the convergence results are established.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. S. Sutton, A. G. Barto et al. , Reinforcement learning: An introduction . MIT press Cambridge, 1998, vol. 1, no. 1.
- 2[2] D. Bertsekas, Reinforcement learning and optimal control . Athena Scientific, 2019, vol. 1.
- 3[3] P. Lancaster and L. Rodman, Algebraic Riccati equations . Clarendon press, 1995.
- 4[4] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Transactions on Automation Science and Engineering , vol. 9, no. 3, pp. 628–634, 2012.
- 5[5] T. Bian and Z.-P. Jiang, “Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design,” Automatica , vol. 71, pp. 348–360, 2016.
- 6[6] D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Transactions on Automatic Control , vol. 13, no. 1, pp. 114–115, 1968.
- 7[7] R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation,” Automatica , vol. 33, no. 12, pp. 2159–2177, 1997.
- 8[8] Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous–time linear systems with completely unknown dynamics,” Automatica , vol. 48, no. 10, pp. 2699–2704, 2012.
