Sumit Kumar Jha,and Shubhendu Bhasin,
Abstract—In this paper,adaptive linear quadratic regulator(LQR)is proposed for continuous-time systems with uncertain dynamics. The dynamic state-feedback controller uses inputoutput data along the system trajectory to continuously adapt and converge to the optimal controller.The result differs from previous results in that the adaptive optimal controller is designed without the knowledge of the system dynamics and an initial stabilizing policy.Further,the controller is updated continuously using input-output data, as opposed to the commonly used switched/intermittent updates which can potentially lead to stability issues.An online state derivative estimator facilitates the design of a model-free controller.Gradient-based update laws are developed for online estimation of the optimal gain.Uniform exponential stability of the closed-loop system is established using the Lyapunov-based analysis,and a simulation example is provided to validate the theoretical contribution.
THE development of the infinite-horizon linear quadratic regulator(LQR)[1]has been one of the most important contributions in linear optimal control theory.The optimal control law for the LQR problem is expressed in state-feedback form,where the optimal gain is obtained from the solution of the nonlinear matrix equation–the algebraic Riccati equation(ARE).The solution of the ARE requires exact knowledge of the system matrices and is typically found offline,a major impediment to online real-time control.
Recent research has focused on solving the optimal control problem using iterative,data-driven algorithms which can be implemented online and require minimal knowledge of the system dynamics[2]?[15].In[2],Kleinman proposed a computationally efficient procedure for solving the ARE by iterating on the solution of the linear Lyapunov equation,with proven convergence to the optimal policy for any initial condition.The Newton-Kleinman algorithm[2],although offline and model-based,paved the way for a class of reinforcement learning(RL)/approximate dynamic programming(ADP)-based algorithms which utilize data along the system trajectory to learn the optimal policy[4],[7],[10],[16]?[18].Strong connections between RL/ADP and optimal control have been established[19]?[23]and several RL algorithms including policy iteration(PI),value iteration(VI)and Qlearning have been adapted for optimal control problems[4],[7]?[9],[13],[22],[24].Initial research on adaptive optimal control was mostly concentrated in the discrete-time domain due to the recursive nature of RL/ADP algorithms.An important contribution in[4]is the development of a modelfree PI algorithm using Q-functions for discrete-time adaptive linear quadratic control.The iterative RL/ADP algorithms have since been applied to various discrete-time optimal control problems[25]?[27].
Extension to continuous-time systems entails challenges in controller development and convergence/stability proofs.One of the first adaptive optimal controllers for continuous-time systems is proposed in[17],where a model-based algorithm is designed using a continuous-time version of the temporal difference(TD)error.Model-free RL algorithms for continuoustime systems are proposed in[22],which require measurement of the state derivatives.In chapter 7 of[3],an indirect adaptive optimal linear quadratic(ALQ)controller is proposed,where the unknown system parameters are identified using an online adaptive update law,and the ARE is solved at every time instant using the current parameter estimates.However,the algorithm may become computationally prohibitive for higher dimensional systems,owing to the need for solving the ARE at every time instant.More recently,partially model-free PI algorithms are developed in[7],[24]for linear systems with unknown internal dynamics.In[9],[10],the idea in[7]is extended to adaptive optimal control of linear systems with completely unknown dynamics.In another significant contribution[6],the connections between Q-learning and the Pontryagin’s minimum principle are established,based on which an off policy control algorithm is proposed.
A common feature of RL algorithms adapted for continuous-time systems is the requirement of an initial stabilizing policy[7],[9],[10],[18],[24],and a batch least square estimation algorithm leading to intermittent updates of the control policy[7],[9].Finding an initial stabilizing policy for systems with unknown dynamics may not always possible.Further,the intermittent control policy updates in[7],[9],[18]render the control law discontinuous,potentially leading to challenges in proving stability.Moreover,many adaptive optimal control algorithms require to implement delayedwindow integrals to construct the regressor/design update laws[5],[7],[9],[14],and“intelligent”data storage mechanism(procedure for populating independent set of data)[5],[7],[9],[10]to satisfy an underlying full-rank condition.The computation of delayed-window integrals of functions of states requires past data storage for the time interval[t ?T, t],?t >0,wheretandTare the current time instant and the window length,respectively,which demands significant memory consumption,especially for large scale systems.
Recent works in[8],[11],[13]have cast the continuoustime RL problem in an adaptive control framework with continuous policy updates,without the need for an initial stabilizing policy.However,for continuous-time RL,it is not straight forward to develop a fixed-point equation for parameter updation,which is independent of the knowledge of system dynamics and state derivatives.A synchronous PI algorithm for known system dynamics is developed in[8],which is extended to a partially model-free method using a novel actor-critic-identifier architecture [11]. For inputconstrained systems with completely unknown dynamics,a PI and neural network(NN)based adaptive control algorithm is proposed in[13].However,the work in[13]utilizes past stored data along with the current data for identifier design,while guaranteeing bounded convergence of critic weight estimation error for bounded NN reconstruction error.
The contribution of this paper is the design of a continuoustime adaptive LQR with a time-varying state-feedback gain,which is shown to exponentially converge to the optimal gain.The novelty of the proposed result lies in the computational/memory efficient algorithm used to solve the optimal control problem for uncertain dynamics,without requiring an initial stabilizing control policy,unlike previous results which either use an initial stabilizing control policy and a switched policy update[5],[7],[9],[10]or past data storage[5],[7],[9],[10],[28],[29]or memory-intensive delayedwindow integrals[5],[7],[9],[14].The result in this paper is facilitated by the development of a fixed point equation which is independent of system matrices,and the design of a state derivative estimator.A gradient-based update law is devised for online adaptation of the state-feedback gain and convergence to the optimal gain is shown,provided a uniform persistence of excitation(u-PE)condition[30],[31]on the state-dependent regressor is satisfied.The u-PE condition,although restrictive in its verification and implementation,establishes the theoretical requirements for convergence of adaptive linear quadratic controller proposed in the paper.The Lyapunov analysis is used to prove uniform exponential stability of the overall system.
This paper is organized as follows.Section II discusses the primary concepts of linear optimal control,problem formulation,and subsequently the general methodology.The proposed model-free adaptive optimal control design along with the state derivative estimator is described in Section III.Convergence and exponential stability of the proposed result is shown in Section IV.Finally,an illustrative example is given in Section V.
Notations:Throughout this paper,R is used to denote the set of real numbers.The operator||.||designates the Euclidean norm for vectors and induced matrix norm for matrices.The symbol?denotes the Kronecker product operator andvec(Z)∈Rqrdenotes the vectorization of the argument matrixZ ∈Rq×rand is obtained by stacking columns of the argument matrix on top of one another.The operatorsλmin(.)andλmax(.)denote the minimum and maximum eigenvalues of the argument matrix,respectively.The symbolBddenotes the open ballBd={z ∈Rn(n+m):||z|| 1),where matrix multiplication(DEF)is defined. 2)vec(D+E+F)=vec(D)+vec(E)+vec(F),where matrix summation(D+E+F)is defined. wherea,bare vectors,Dis a matrix and the multiplication(aT Db)is defined,has also been used. Consider a continuous-time deterministic LTI system given as wherex(t)∈Rndenotes the state andu(t)∈Rmdenotes the control input.A ∈Rn×nandB ∈Rn×mare constant unknown matrices and(A,B)are assumed to be controllable. The infinite horizon quadratic value function can be defined as the total cost starting from statex(t)and following a fixed control actionu(t)from timetonwards as whereQ ∈Rn×nis symmetric positive semi-definite with(Q,A)being observable andR ∈Rm×mis a positive definite matrix. WhenAandBare accurately known,the standard LQR problem is to find the optimal policy by minimizing the value function(2)with respect to the policyu. whereK?=R?1BT P ?∈Rm×nis the optimal control gain matrix andP ?∈Rn×nis the constant positive definite matrix solution of ARE[32] Remark 1:It is obvious that solving the ARE forP ?requires knowledge of the system matricesAandB,however,in the case where information aboutAandBis unavailable,it is challenging to determineP ?andK?online. The following assumptions are required to facilitate the subsequent design. Assumption 1:The optimal Riccati matrixP ?is upper bounded as∥P ?∥≤α1,whereα1is a known positive scalar constant. Assumption 2:The optimal gain matrixK?is upper bounded as∥K?∥≤α2,whereα2is a known positive scalar constant. For the linear system in(1),the optimal value function can be written as a quadratic function[33] To facilitate the development of the model-free LQR,differentiate(5)with respect to time and use system dynamics(1)to obtain Using(4),(6)reduces to The LHS of(7)can be written asby considering(5),which is then substituted in(7)as The expression in(8)acts as the fixed point equation used to defineD ∈R as the difference between LHS and RHS of(8) Remark 2:The motivation behind the formulation of(9)is to represent the fixed point equation in a model-free way without using memory-intensive delayed-window integrals and subsequently design a parameter estimation algorithm to learnP ?andK?without knowledge of system matricesAandB. In(9),P ?andK?are unknown parameter matrices and the objective is to estimate these parameters using gradient-based update laws. The gradient-based update laws are developed which minimize the squared errorΞ ∈R defined asΞ=E2/2.The update laws for the parameters to be estimated are given by whereν ∈R+andνk ∈R+are adaptation gains.Substituting the values of gradients ofΞwith respect toandthe normalized update laws are given as The continuous policy update is given as The design of the state derivative estimator,mentioned in (11) and (12), is facilitated by expressing the system dynamics(1)as linear-in-the-parameters(LIP) whereY(x,u)∈Rn×n(n+m)is the regressor matrix andθ ∈Rn(n+m)is the unknown vector defined as Assumption 3:The system parameter vectorθin(16)is upper bounded as∥θ∥≤a1,wherea1is a known positive constant. The state derivative estimator is designed as where Γ∈Rn(n+m)×n(n+m)is the constant positive definite gain matrix. Lemma 1:The update laws in(17)and(19)ensure that the state estimation and the system parameter estimation error dynamics are Lyapunov stable?t ≥0. Proof:Consider a positive-definite Lyapunov function candidate as Taking time derivative of(20)and substituting the value offrom(18),the following expression is obtained where Sinceandis bounded which implies that Remark 3:Assumptions 1 and 2 are standard assumptions required for projection based adaptive algorithms,frequently used in robust adaptive control literature([3],Chapter 11 of[36],Chapter 3 of[37],[38]).In fact,in the context of adaptive optimal control,analogous to Assumptions 1 and 2,many existing results[8],[11],[13],[14],[29]assume a known upper bound of the unknown parameters associated with the value function,an essential requirement for proving stability of the closed-loop system.Although the true system parameters(AandB)are unknown,a range of operating values(a compact set containing the true values of the elements ofAandB)may be known in many cases from the particular domain knowledge of the plant.Performing a uniform sampling over the known compact set and solving the ARE offline with those samples,a set of Riccati matrices can be obtained,and hence,the upper bounds(α1andα2),assumed in Assumptions 1 and 2,can be conservatively estimated using this set.Moreover,the proposed algorithm serves as an effective approach for the case where it is hard to obtain the initial stabilizing policy for uncertain systems. A.Development of Controller Parameter Estimation Error Dynamics The controller parameter estimation error dynamics forcan be obtained using(11)and(13)as where Using thevecoperator in(22),the following expression is obtained and Using(15)and(23),the system dynamics in terms of the error statez(t)can be expressed as whereF ∈Rn(1+m)is a vector valued function containing the right hand sides of(15)and(23). Assumption 4:The pair(φk,F)is u-PE,i.e.,PE uniformly in the initial conditions(z0,t0),if for eachd>0,?ε,δ>0 such that,?(z0,t0)∈Bd×[0,∞),all corresponding solutions satisfy ?t ≥t0[30]. Remark 4:Since the regressorφk(z,t)in(23)is state dependent,the u-PE condition in(26),which is uniform in initial condition,is used instead of the classical PE condition,where the regressor is only function of time and not the states,e.g.,where the objective is identification(Section 2.5 of[39]). Remark 5:In adaptive control,convergence of system and control parameter error vectors are dependent on the excitation of the system regressors.This excitation property,typically known as persistence of excitation (PE), is necessary to achieve perfect identification and adaptation.The PE condition,although restrictive in its verification and implementation,is typically imposed by using a reference input with as many spectral lines as the number of unknown parameters[40].The u-PE condition mentioned in Assumption 4 may be satisfied by adding a probing exploratory signal to the control input[4],[8],[11],[13],[41].This signal can be removed once the parameter estimateconverges to optimal control policy and subsequently,exact regulation of the system states will be achieved.Exact regulation of the system states in presence of persistently exciting signal can also be achieved by following the method given in[42],in which the PE property is generated in a finite time interval by an asymptotically decaying“rich”feedback law. The expression in(23)can be represented using a perturbed system as For eachd>0,the dynamics of the nominal system can be shown to be uniformly exponentially stable?(z0,t0)∈Bd×[0,∞)by using Assumption 4,(25)and Lemma 5 of[31]. Sinceis continuously differentiable and the Jacobianis bounded for the nominal system(28),it can be shown,by referring to the converse Lyapunov Theorem 4.14 in[43]and definitions and results in[31],[44],that there exists a Lyapunov function,which satisfies following inequalities. for some positive constantsd1,d2,d3,d4∈R. B.Lyapunov Stability Analysis Theorem 1:If Assumption 4 holds,the adaptive optimal controller(14)along with the parameter update laws(12)and(13)and the state derivative estimators(17)and(19)guarantees that the system states and the controller parameter estimation errorsz(t)are uniformly exponentially stable?t ≥0,providedz(0)∈?,where the set?is defined as1The initial condition region ? can be increased by appropriately choosing user defined matrices Q,R,and by tuning design parameters ν,νk and ηk. Proof:A positive-definite,continuously differentiable Lyapunov function candidateVL:Bd×[0,∞)→R is defined for eachd>0 as whereV ?(x)is the optimal value function defined in(5)which is positive definite and continuously differentiable andVcis defined in(29).Taking the time derivative ofVL,along the trajectories of(1)and(27),the following expression is obtained Using(6),(29)and the Rayleigh-Ritz theorem,can be upper bounded as where where the known functionρ2(∥z∥):R→R,defined asρ2(∥z∥)=2l2∥x∥2/d3,is positive,globally invertible and non-decreasing andˉν=1/νk ∈R.By using(24),(34)can be further expressed as Using(5),(24)and(29),the Lyapunov function candidateVLcan be bounded as whereσ1andσ2are positive constants. Using(36),(35)can be expressed as The expression in(37)can be further upper bounded by where the set?is defined as Ifz(0)∈?,then by looking at the solution of(38), it can be said that system states and the parameter estimation errors uniformly exponentially converge to the origin. Remark 6:The positive constantsd1,d2,d4in(29)do not appear in the design of the control law(14) or the parameter update law(13)and are only utilized for the stability analysis purpose.As a result,knowing the exact values of these constants is not required in general.However,the quantityd3,which appears in Theorem 1,can be determined by following the procedure given in[43](for details see proof of Theorem 4.14 in[43]). Remark 7:Traditionally, the parameter update laws in adaptive control have user defined design parameters termed as adaptation gains(in this paperνandνkdefined in(12)and(13),respectively).Typically,these gains are responsible for the convergence rate of the estimation of the unknown parameters.Hence,a careful selection of gains govern the performance of the designed estimators.However,a large value of adaptation gain may result in an unstable adaptive system,which can be overcome by introducing“normalization”in the update laws[45].The normalized estimator in the update law(13)involves constant tunable gainηk,which can be chosen in such a way that maintains the system stability in presence of high adaptation gainνk. Remark 8:The estimates of the system matricesAandB,given by(19),are not guaranteed to converge to the optimal parameters,since Lemma 1 only proves that the parameter estimation erroris bounded.Therefore,solving ARE in(4)using the estimates ofAandBmay not yield the optimal parameterP ?andK?.Moreover,solvingP ?directly from the ARE,which is nonlinear inP ?,can be challenging,especially for large scale systems.However,the proposed method utilizes the estimates ofAandBin the estimator design of the controller parametersP ?andK?.The adaptive update laws forandin(12)and(13),include the identifierwhich is designed in(17),and uses(estimates ofAandB).The proposed design is architecturally analogous to[11],[13],[29],where a system identifier is utilized in controller parameter estimation.Also,note that although the system parameter estimatesandare only guaranteed to be bounded,the controller parameter estimatesandare proved to be exponentially convergent to the optimal parameters,as proved in Theorem 1. C.Comparison With Existing Literature One of the main contributions of the result is that the initial stabilizing policy assumption is not required,unlike the iterative algorithms in[5],[7],[9],[10],where an initial stabilizing policy is assumed to ensure that the subsequent policies remain stabilizing.On the other hand,an adaptive control framework is considered in the proposed approach where the control policies are continuously updated until convergence to the optimal policy.The design of the controller,the parameter update laws and the state derivative estimator ensure exponential stability of the closed-loop system which is proved using a rigorous Lyapunov-based stability analysis,irrespective of the initial control policy(stabilizing or destabilizing)chosen. Moreover,other significant contributions of this paper with respect to some of the existing literatures are highlighted as follows. The algorithms proposed in[5],[7],[9],[10]require computation of delayed-window integrals to construct the regressor,and/or“intelligent”data storage mechanism to satisfy an underlying full-rank condition.Computation of delayedwindow integrals require past data storage for the time interval[t ?T, t],?t >0,wheretandTare the current time instant and the window length,respectively,which demands significant consumption of memory stacks,especially for large scale systems.Unlike[5],[7],[9],[10],the proposed work strategically obviates the requirement of memory intensive delayed-window integrals and“intelligent”data storage,a definite advantage in the case of large scale systems implemented on embedded hardware. Although the result in[14]designs an actor-critic architecture based adaptive optimal controller for uncertain LTI systems,it uses memory-intensive delayed-window integral based Bellman error(see the error expression for“e”defined below(17)in[14])to tune the critic weight estimates ?Wc.Unlike[14],the proposed algorithm uses an online state derivative estimator to obviate the need of past data storage for control parameter estimation by strategically formulating Bellman error“E”(11)to be independent of delayed-window integrals.Further,an exponential stability result is obtained using the proposed algorithm as compared to the asymptotic result achieved in[14]. Recent results in[28],[29]relax the PE condition by concurrently applying past stored data along with the current parameter estimates,however,unlike[28],[29],the proposed result is established for completely uncertain systems without requiring past data storage.Moreover,a stronger exponential regulation result is obtained using the proposed controller,while obviating the need of past data storage,as compared to[28],[29]. The proposed result also differs from the ALQ algorithm[3]in that it avoids the computational burden of solving the ARE(with the estimates ofAandB)at every iteration,thus also avoiding the restrictive condition on stabilizability of estimates ofAandB,at every iteration. To verify the effectiveness of the proposed result, the problem of controlling the angular position of the shaft in a DC motor is considered[12].The plant is modeled as a third order continuous-time LTI system and its system matrices are given as The objective is to find the optimal control policy for the infinite horizon value function(2),where the state and input penalties are taken asQ=I3andR=1,respectively.Solving ARE(4)for the given system dynamics,the optimal control gainK?is obtained asK?=[1.0 0.8549 0.4791].The gains for parameter update laws(12)and(13)are chosen asν= 35,νk= 55 andηk= 5.The gain matrix of the state derivative estimator is selected asL=I3.An exploration signal,comprising of a sum of sinusoids with irrational frequencies,is added to the control input in(14)which subsequently leads to the convergence of control gain to its optimal values(depicted by$)as shown in Fig.1. Fig.1. The evolution of parameter estimate ?K(t)for the proposed method. The proposed method is compared with the recently published work in[14].The Q-learning algorithm proposed in[14]solves adaptive optimal control problem for completely uncertain linear time invariant(LTI)systems.The norms of the control gain estimation error(used in the proposed work)and the actor weight estimation error(as discussed in[14]and analogous to theare depicted in Fig.2. Fig.2. Comparison of the parameter estimation error norms between[14]and the proposed method. The initial conditions are chosen as[0 0 0]andx(0)=[?0.2 0.2?0.2]T,and the gains for the update laws of the approach in[14]are chosen asαa=6 andαc=50.To ensure sufficient excitation,an exploration noise is added to the control input up tot=4 s in both cases. From the Fig.3,it can be observed that for similar control inputs,the convergence rates for both the methods(as shown in Fig.2)are comparable.However,as opposed to the memoryintensive delayed-window integration for the calculation of the regressor in[14],the proposed result does not use paststored data and hence is more memory efficient.Further,an exponential stability result is obtained using the proposed controller as compared to the asymptotic result obtained in[14].As seen from Figs.4 and 5,the state trajectories for both the methods initially have bounded perturbation around origin due to the presence of the exploration signal.However,once this signal is removed aftert=4 s,the trajectories converge to the origin. Fig.3. Comparison of the control inputs between[14]and the proposed method. Fig.4. System state trajectories for the proposed method. Fig.5. System state trajectories for[14]. An adaptive LQR is developed for continuous-time LTI systems with uncertain dynamics. Unlike previous results on adaptive optimal control which use RL/ADP methods,the proposed adaptive controller is memory/computationally efficient and does not require an initial stabilizing policy.The result hinges on a u-PE condition on the regressor vector,which is shown to be critical for proving convergence to the optimal controller.Future work will be focused on relaxing the restrictive u-PE condition without compromising the merits of the proposed result.The Lyapunov analysis is used to prove uniform exponential stability of the tracking error and parameter estimation error dynamics.Simulation results validate the efficacy of the proposed algorithm. APPENDIX EVALUATION OF BOUND FOR This section presents bounds on different terms encountered at different stages of the proof for Theorem 1.These bounds,comprising of norms of the elements of the vectorz(t)defined in(24),are developed by using(13),(15),(18),(19),Lemma 1 and considering standardvecoperator and Kronecker product properties. The following inequality results from the use of projection operator in(12)[35]. The expression in(39)is upper bounded,by using Assumptions 1 and 2,Lemma 1,(40)and the following supporting bounds wherehi ∈R fori=1,2,...,11 are positive constants and in(41b),equality expressionis used,as where the known functionρ1(∥z∥):R→R is a positive,globally invertible and non decreasing andz ∈Rn(n+m)is defined in(24).II.PRELIMINARIES AND PROBLEM FORMULATION
III.OPTIMAL CONTROL DESIGN FOR COMPLETELY UNKNOWN LTI SYSTEMS
IV.CONVERGENCE AND STABILITY
V.SIMULATION
VI.CONCLUSION
IEEE/CAA Journal of Automatica Sinica2020年3期