• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Recent Progress in Reinforcement Learning and Adaptive Dynamic Programming for Advanced Control Applications

    2024-01-27 06:49:04DingWangNingGaoDerongLiuJinnaLiandFrankLewisLife
    IEEE/CAA Journal of Automatica Sinica 2024年1期

    Ding Wang ,,, Ning Gao , Derong Liu ,,,Jinna Li ,,, and Frank L.Lewis , Life,

    Abstract—Reinforcement learning (RL) has roots in dynamic programming and it is called adaptive/approximate dynamic programming (ADP) within the control community.This paper reviews recent developments in ADP along with RL and its applications to various advanced control fields.First, the background of the development of ADP is described, emphasizing the significance of regulation and tracking control problems.Some effective offline and online algorithms for ADP/adaptive critic control are displayed, where the main results towards discrete-time systems and continuous-time systems are surveyed, respectively.Then, the research progress on adaptive critic control based on the event-triggered framework and under uncertain environment is discussed, respectively, where event-based design, robust stabilization, and game design are reviewed.Moreover, the extensions of ADP for addressing control problems under complex environment attract enormous attention.The ADP architecture is revisited under the perspective of data-driven and RL frameworks,showing how they promote ADP formulation significantly.Finally, several typical control applications with respect to RL and ADP are summarized, particularly in the fields of wastewater treatment processes and power systems, followed by some general prospects for future research.Overall, the comprehensive survey on ADP and RL for advanced control applications has demonstrated its remarkable potential within the artificial intelligence era.In addition, it also plays a vital role in promoting environmental protection and industrial intelligence.

    I.INTRODUCTION

    ARTIFICIAL intelligence (AI) generally refers to the intelligence exhibited through machines that humans make.The definition of AI is very broad, starting from the legends of robots and androids in Greek mythology to the well-known Turing Test, and now to the development of various intelligent algorithms in the framework of machine learning [1]–[4].The AI technology is gradually changing our lives, from computer vision, big data processing, intelligent automation, smart factories, etc., to other aspects.

    As the hottest technology in the 21st century, AI cannot be developed without machine learning.Machine learning, as the core of AI, is the foundation of computer intelligence.Reinforcement learning (RL) [5]–[7] is one of the top three approaches of machine learning, along with supervised learning and unsupervised learning.RL emphasizes the interaction between the environment and the agent, with a focus on the long-term interaction to change its policies.Through its interactions with the environment, the agent can modify future actions or control policies based on the response to its stimulating actions.

    It should be emphasized that RL does not necessarily require a perfect environment model or huge computing resources.RL is inseparable from dynamic programming [8],[9].Traditional dynamic programming has been investigated considerably in theory, which can provide the key foundation of RL.However, this technique requires the assumption of an exact system model, which is extravagant for large-scale complex nonlinear systems.Besides, such methods are severely limited in solving Hamilton-Jacobi-Bellman (HJB) equations of nonlinear systems as the dimensionality of states and controls increase [8].Therefore, adaptive/approximate dynamic programming (ADP) [10]–[13], a method combining RL,dynamic programming, and neural networks, was skillfully proposed.

    ADP has been widely used to solve a range of optimal control problems for complex nonlinear systems in unknown environments.As main algorithmic frameworks, value iteration (VI) and policy iteration (PI) have been intensively promoted.The initialization requirements for VI and PI are different.Unlike PI, which must start with an initial admissible control law, the initial control law of VI has no strict requirements.But from the iterative control point of view, PI presents a more stable mechanism.Both of these two algorithms are attracting more and more attention from the control community [14], [15].Due to their respective properties, VI has received more attention in the discrete-time domain, while PI is more commonly applied to the continuous-time domain.

    There have been many classical reviews and monographs[13], [16]–[19] that summarize and discuss ADP/RL.They have brought in the profound influence and inspiration on their successors.However, it is rare to find a paper that integrates the regulator problem, the tracking control problem, the multi-agent problem, the robustness of uncertain systems, and the event-triggered mechanism, especially with discussions on both discrete-time and continuous-time cases.In this paper,we aim to discuss recent research progress on these problems,primarily focusing on discrete-time systems while supplementing some excellent work on continuous-time systems.Fig.1 illustrates some of the key technologies in the ADP field involved in this paper.In order to promote the further development of ADP, this paper provides a comprehensive overview of theoretical research, algorithm implementation,and related applications.It covers the latest research advances and also analyzes and predicts the future trends of ADP.This paper consists of the following parts: 1) basic background,2) recent progress of ADP in the field of optimal control, 3)development of ADP in the event-triggered framework, 4)development of ADP in complex environments and the combination of ADP with other advanced control methods, 5) the impact of the data-driven method and RL on ADP technology,6) typical control applications of ADP and RL, and 7) discussion of the possible future directions for ADP.

    Fig.1.Taxonomy diagram of related methods in this survey.

    II.OPTIMAL REGULATION AND TRACKING WITH ADP

    Generally speaking, ADP-based algorithms can be performed offline or online.In this section, we focus on the problem of optimal regulation and optimal tracking control, with a detailed overview.

    A. Offline Optimal Regulation With ADP

    1)Discrete-Time Systems: Consider the following affine discrete-time nonlinear systems:

    TABLE I BASIC TERMS OF THE GENERAL ADP STRUCTURE

    Fig.2.The general structure of ADP.

    Besides, the action-dependent and goal-representation versions are also used sometimes for these structures.Taking HDP as an example, the action-dependent HDP (ADHDP)[21] consists of three parts: the controlled object, the critic network, and the action network.It is capable of achieving optimal control without using system information.Compared with ADHDP, a network was added to goal-representation HDP (GrHDP) [22], which can be associated with the critic network and the action network.The goal network can generate, control, calculate, and plan more accurate system signals.It also improves the learning ability of the control system.In Table II, we show the comparison of the input and output of ADHDP with GrHDP, where

    is the internal reinforcement signal and 0

    TABLE II COMPARISON OF THE INPUTS AND THE OUTPUTS BETWEEN ADHDP AND GRHDP

    Compared to VI [20], monotonicity is various for iterative cost function sequence with the more general initial condition in general VI (GVI) [26].The GVI algorithm can be initialized by a positive semi-definite functionV0(x)=xTΦx, where Φis a positive semi-definite matrix.Furthermore, the iterative cost function was proved to satisfy the inequality

    where 0 ≤α ≤1, 1 ≤β<∞, and 1 ≤θ<∞.

    Then, by using a novel convergence analysis, Weiet al.[27]proved the convergence and optimality of GVI.In addition, it was shown that the termination criterion in [20] could not guarantee admissibility of the near-optimal control policy.The admissibility termination criterion was proposed as

    Following [27], Haet al.[28] presented a new admissibility condition of GVI described by

    and policy improvement

    In [31], the convergence and stability of PI were analyzed for the first time, where the initial admissible control law was obtained by trial-and-error.The iteration process can ensure that all iterative control laws were stable.Compared with [31],the admissible control law can be obtained more conveniently in [27] and [28].Based on [31], Liuet al.[15] proposed the generalized PI (GPI) for optimal control of system (1), where the convergence and optimality properties were guaranteed.In essence, VI [20] and PI [31] were both special cases of GPI.Since many systems could only be locally stabilized, in order to solve the regionality existing in discrete-time optimal control, an invariant PI method was proposed by Zhuet al.[32],where the suitable region for the new policy was updated.

    In addition, there are some works on the combination of VI and PI.In [33], Luoet al.introduced an adaptive method to solve the Bellman equation, which balanced VI and PI by adding a balance factor.It is noted that the algorithm in [33]can accelerate the iterative process and do not need the initial admissible control law.To obtain the stable iterative control policy, Heydari [34] proposed stabilizing VI, where the initial admissibleu0was evaluated to implement the VI.Based on[34], Haet al.[28] developed an integrated VI method based on GVI, which was used to generate the admissible control law.InTableIII, wesummarizetheinitialconditionsand monotonicityofGVI(V0≤V1),GVI(V0≥V1),Stabilizing VI,andIntegratedVI.IntegratedVIconsistsofGVI(V0≤V1)andStabilizingVI, whereGVI (V0≤V1)provides the initial admissible control policy for Stabilizing VI.Therefore, in the following table, integrated VI only represents the monotonicity of its core component.

    TABLE III CLASSIFICATION OF VI ALGORITHMS

    2)Continuous-Time Systems: Compared with discrete-time systems, most literature focuses on the PI method for continuous-time systems even though the VI strategy still can be used as in [35].We consider the continuous-time systems

    wherex(t)∈Rn,u(t)∈Rm,f(·)∈Rn, andg(·)∈Rn×mrepresent the state vector, the control vector, the drift dynamics,and the input dynamics, respectively.Assume thatf(0)=0 and the system is stabilized on the operation region.For systems in the strict feedback form with uncertain dynamics,Zargarzadehet al.[36] utilized neural networks to estimate the cost function by using state measurement.In [37], a databased continuous-time PI algorithm was proposed, where a critic-identifier was introduced to estimate the cost function and the Hamiltonian of the admissible policy.Differently from [38], the algorithm in [37] was used in continuous-time systems and did not require samples of the input and output trajectories of the system.Compared with [36], the method proposed in [37] could be extended to multicontroller systems.In addition, to release the computational burden, a novel distributed PI algorithm was established in [39].The iterative control policy could be updated one by one.The above works[35]–[39] all focused on time-invariant nonlinear systems.Moreover,V(·) andu(·) both relied on the system state.In[40], for time-varying nonlinear systems, Weiet al.developed a novel PI algorithm, where the optimality and stability were discussed.It is worth noting that a mass of literatures concentrate on the progress of VI algorithms and their structures are similar to those of discrete-time systems.Bian and Jiang [41] extended VI to continuous-time nonlinear systems.

    B. Online Optimal Regulation With ADP

    1)Discrete-Time Systems: As mentioned in [34], differently from offline ADP, online ADP needs to be implemented through selecting the initial control policy, and improving it according to some criteria until it converges to the optimal value.Note that the key difference between offline ADP and online ADP is that the control policy generated by offline ADP keeps unchanged during the controlled stage of systems,whereas the control policy will be updated in online ADP.

    First, consider online ADP for the discrete-time systems.Usually, the optimal cost function and the optimal control policy are approximated by neural networks as

    and

    respectively, where ?cand θaare the weight vectors of target neural networks, εcand εaare the bias terms, σ (·) and δ (·) are the activation function vectors.The optimal cost function is estimated by the critic network

    and the optimal control policy is estimated by the action network

    where ??cand θ?aare the estimated values of ?cand θa, respectively.SinceV?x(k) andu?x(k) satisfy the HJB equation,we get

    Substituting (13) and (14) to the HJB equation, we obtain

    In the above research on online ADP, the approximate optimal control policy was updated by tuning the weights of neural networks.Besides, the improved control law can be acquired by PI or VI.Since the iterative control policy obtained by PI is stable, PI is widely used in online control.However, there are also some works on updating the control policy by VI.For example, in [34], Heydari proposed an online algorithm based on stabilizing VI, where the system was controlled under different iterative policies.In [14], combining the stability condition of GVI and the concept of attraction domain, the novel online algorithm was introduced by Haet al., where the current control law was chosen by the location of the current state.

    2)Continuous-Time Systems: For continuous-time systems,the principle of online ADP is similar to that of discrete-time systems.Here, we display some main progress on online methods with PI.For the weakly coupled nonlinear systems, a data-based online learning algorithm was established by Liet al.[43], where the original optimal control problem of the weakly coupled systems was transformed into three reducedorder optimal control problems.In [38], Heet al.introduced a novel online PI method, where the technique of neural-network-based online linear differential inclusion was used for the first time.In addition, to solve optimal synchronization of multi-agent systems, an off-policy RL algorithm was presented in [44], where dynamic models of the agents were not required.

    C. Optimal Tracking Design With ADP

    1)Discrete-Time Systems: With the development of aviation, navigation, and other fields in recent years, the research interest in optimal tracking design has gradually increased within the control community.Here, we need to concentrate on the optimal tracking control problem.Define the desired tracking trajectory as

    Considering original system (1), the tracking errore(k) is described as

    Assume that there exists the steady controlud(k) to make the following equation hold:

    The objective of the optimal tracking control problem is to find the optimal control lawux(k), which can force the system output to track the reference trajectory.This can be obtained by minimizing the performance index or the cost function.Hence, the choice of the cost function is of importance without doubt.Generally, we choose the form of the cost function according to the control objective.Wanget al.[23] applied DHP to implement the tracking control design towards nonaffine discrete-time systems, where the discount factor was considered.After that, actuator saturation was also considered in [24].It is noted that the form of the utility function in [45]–[47] is given by

    whereue(k)=ux(k)-ud(k).Since it is not convenient to calculate the reference control policyud(k), some scholars choose other forms of utility function.For example, Kiumarsi and Lewis [48] introduced a partially model-free ADP method.In this work, an optimal tracking control of nonlinear systems with input constraints is achieved by using a discounted performance function based on the augmented system.In [49], Linet al.proposed a policy gradient algorithm and used experience replay for optimal tracking design.They used the Lyapunov’s direct method to prove the uniform ultimate boundedness (UUB) of the closed-loop system.The utility function in [48] and [49] is described as

    Even though the steady control is avoided in (21), it can not eventually eliminate the tracking error.To deal with this problem, Liet al.[50] developed a novel utility function given by

    The optimality of VI and PI was analyzed.In addition, Haet al.[51] also analyzed the system stability of the VI algorithm for the novel utility function with a discount factor.

    2)Continuous-Time Systems: There are also a few works on the continuous-time systems.In [52], Gao and Jiang solved the optimal output regulation problem by ADP and RL, where ADP was for the first time combined with the output regulation problem for adaptive optimal tracking control with disturbance attenuation.However, this approach requires partial knowledge of the system dynamics.To overcome this difficulty, in [53], the integral RL algorithm was introduced to achieve optimal online control, where the off-policy integral RL was employed to obtain the optimal control feedback gain for the first time.In addition, differently from [52], the algorithm in [53] relieved the computational burden.Then, in [54],Fuet al.proposed a robust approximate optimal tracking method.In order to relax the assumption that the reference signal must be continuous in continuous-time systems, a new Lyapunov function was proposed without knowing the derivative information of the tracking error.

    In particular, ADP also plays a pivotal role in the optimal control of linear systems, such as the linear quadratic regulation (LQR) problem and tracking problem [55]–[60].Generally speaking, considering optimal control for nonlinear systems, the HJB equation is usually solved to acquire the optimal control policy.However, the linear system is a special case, which has good properties.The solution of the HJB equation can be transformed into the solution of the algebraic Riccati equations, so as to obtain the exact optimal control law.In [56], Rizvi and Lin proposed an online Q-learning method based on output feedback to tackle the LQR problem.Wanget al.[57] developed an optimal LQR based on the discounted VI algorithm and provided a series of criteria to judge the stability of the systems.In [58], the LQR problem was solved for the continuous-time systems with unknown system dynamics and without an initial stabilizing strategy.The proposed controller was updated continuously by utilizing the measurable input-output data to avoid instability.For the same uncertain systems, Rizvi and Lin [59] proposed a model-free static output feedback controller based on RL, which avoided the influence of the exploration bias problem.In addition,researchers also pay much attention to the optimal tracking design for linear systems.For networked control systems with uncertain dynamics, Jianget al.[60] developed a Q-learning algorithm to obtain the online optimal control policy based on measurable data with network-induced dropouts.

    III.EVENT-TRIGGERED CONTROL WITH ADP

    In this section, we mainly introduce the application of event-triggered technology under the ADP framework.It is discussed for discrete-time systems and continuous-time systems, respectively.

    As an advanced aperiodic control method, event-triggered control plays a vital role in decreasing the computational burden, and enhancing the resource utilization rate.In short, the purpose of introducing the event-triggered mechanism is to reduce the updating times of the controller by decreasing the sampling times of the system state.Unlike the time-triggered control method, event-triggered control is designed with triggering conditions that are required to satisfy the stability of the controlled system.The control input is updated only when this triggering condition is violated.Conversely, if the triggering condition is not violated, the zero-order hold is able to keep the control input unchanged until the next event is triggered.

    A. Event-Triggered Control for Discrete-Time Systems

    For discrete-time systems, the event-triggered technology has been widely used in the adaptive critic framework.In [61],Donget al.used the event-triggered method to solve the optimal control problem under the HDP framework, and proved that the controlled system was asymptotically stable.An event-triggered near-optimal control algorithm was proposed for the affine nonlinear dynamics with constrained inputs in[62].In addition, a special cost function was introduced and the system stability was analyzed.In [63], a novel adaptive control approach with disturbance rejection was designed for linear discrete-time systems.In [64], Zhaoet al.proposed a new event-driven method via direct HDP.Then, the UUB of the system states and the weights in the control policy networks was proven.In [65] and [66], a novel event-triggered optimal tracking method was developed to control the affine system.It is worth noting that the triggering condition in these two works only acts on the time step and the updating stage of weights is not involved in the iterative process.For systems whose models are known, by using the event-triggered control approach, not only the reference trajectory can be tracked,but also the computational burden can effectively be reduced.In [67], Wanget al.proposed an event-based DHP method,where three kinds of neural networks were used to identify nonlinear systems, estimate the gradient of the cost function,and approximate the tracking control law.In addition, the stability of the event-based controlled system was proved by the theorem of input-to-state stability and the control scheme was applied to wastewater treatment simulation platform.

    Then, the event-triggered error vector is defined as

    wherekj≤k

    u?(x(kj))

    The corresponding optimal control can be obtained by

    Next, we introduce several triggering conditions commonly used in combination with adaptive critic methods.

    1) Suppose there exists a positive number I satisfying

    In addition, the inequality //?(k+1)//≤//x(k+1)// holds.By referring to [61], [62], [67], a triggering condition was designed as

    We can obtain different levels of triggering effect by appropriately adjusting the parameter I.

    2) According to the updating method of neural networks in[64], we assume that the activation function σain the action network satisfies

    for allx1,x2∈X, where P is a positive constant and X is the domain of system dynamics.

    Lemma 1[64]: Let (29) hold for the nonlinear system (1).Assume the triggering condition is defined as follows:

    where 0 ≤β<1 andw?ais the weight of the action network.λmin(·) and λmax(·) represent the minimal and maximal eigenvalues of a matrix, respectively.In addition, we make the action network learning rate satisfy

    and the critic network learning rate satisfy

    Then, we can declare that the event-based control input can guarantee the UUB of the controlled system.

    3) The triggering condition described below can only be applied to the time-based case.For the iterative process, the traditional time-triggered method is adopted.In [65], [66], a triggering condition was defined as follows:

    where

    According to the results in [65], the adjustable parameterγplays an essential role in the event-triggered optimal control.If the main emphasis is on optimizing the cost function,γshould be chosen as small as possible.On the contrary, when considering resource utilization,γshould be chosen as large as possible.Therefore, the selection ofγshould be determined according to the actual need.

    B. Event-Triggered Control for Continuous-Time Systems

    There are extensive studies of event-triggered control methods within the framework of ADP for continuous-time systems.In [68], Luoet al.designed an event-triggered optimal control method directly based on the solution of the HJB equation.In addition, the stability of the system and the lower bound on the interexecution times were proved theoretically.In [69], for a class of nonlinear multi-agent systems, novel event-triggered and asynchronous edge-event triggered mechanisms were designed for the leader and all edges, respectively.In [70], Huoet al.developed a decentralized event-triggered control method to aperiodically update each auxiliary subsystem.In [71], a different event-based decentralized control scheme was proposed.They used codesign strategies to trade-off control policies and triggering thresholds to simultaneously achieve optimization of subsystem performance and reduction of computational burden.

    Considering the continuous-time nonlinear system (10), we assumef+guto be Lipschitz continuous on Ω that contains the origin.Weassumethatthere existsan admissible controlux(t)andthecostfunctionisdefinedas

    Next, the optimal control law under the time-triggered mechanism is defined as

    The event-triggered mechanism is similar to that of discretetime systems.Therefore, we define the state as

    for allj∈N.The optimal control law under the event-triggered mechanism can be expressed as

    For conventional event-triggered control, the design of triggering conditions is inevitable.Next, we introduce two triggering conditions under continuous-time environments.

    1) This triggering condition is established based on a reasonable Lipschitz condition.

    with α>0 being a constant.Then, it is proved that the controlled system was asymptotically stable by using this triggering condition.

    The main purpose of the event-triggered technology is to reduce the waste of communication resources and improve computational efficiency.In recent years, networked control systems have attracted extensive attention.There is also an increasing amount of work aimed at reducing the energy consumption of network interfaces and ensuring the sustainability of networked control systems.Some related studies can be found in [73], [74].

    IV.ROBUST CONTROL AND GAME DESIGN WITH ADP

    In modern engineering systems, the real control plants are always affected by changes derived from the system model,external environment, and other factors.Hence, it is of great importance to attain the robust control strategy to avoid the influence of uncertainties.The problem of robust control can be turned into a problem of optimal control, which is a useful method for attaining the robust controller.However, for complex nonlinear systems, it is difficult to solve the optimal control problem.To deal with this dilemma, the ADP method is utilized.In this section, recent research progress of ADP is described, such as using ADP to solve the problem of robust control,H∞control, and multi-player game design.In addition, some other advanced control methods with ADP are supplemented at the end of this section.

    A. Robust Control Design With ADP

    By utilizing ADP, robust controllers can be designed based on the obtained optimal control strategy.Compared to traditional methods, controllers guided by ADP can not only stabilize the system, but also optimize the performance of systems.The recent work on robust control is analyzed from both discrete-time and continuous-time aspects in this section.

    1)Discrete-Time Systems: We consider a class of discretetime nonlinear systems with uncertain terms as

    crete-time HJB equation (49) becomes

    By choosing an appropriate utility function, robust stabilization was transformed into an optimal control problem for nominal systems [75]–[77].In [76], the idea of solving the generalized HJB equation was employed to derive a robust control policy for discrete-time nonlinear systems subject to matched uncertainties.A neural network was used as the function approximator.In addition, Liet al.[77] proposed an adaptive interleaved RL algorithm to find the robust controller of discrete-time nonlinear systems subject to matched or mismatched uncertainties.An action-critic structure was given to skillfully handle experiments.The convergence of the proposed algorithm and the UUB of the system were proved.An appropriate utility function was chosen as

    Notethat there is a new term βx(k) in the utility function com(pare)d( to t)he traditional expression ofxT(k)Qx(k)+uTx(k)Rux(k).Tripathyet al.[78] introduced a virtual input to compensate the effect of uncertainties.By defining a sufficient condition, the stable control law of the mismatched system was derived.At the same time, the stability of the uncertain system was proved.The uncertainty can be decomposed in matched and mismatched components as

    2)Continuous-Time Systems: For continuous-time nonlinear systems, the principle of robust control with ADP is similar to that of discrete-time systems.Considering uncertainties,the continuous-time nonlinear system is defined as

    and the corresponding nominal system is defined as in (10).In order to obtain the optimal feedback control lawwe need to minimize the cost function

    where ρ>0, and the utility functionr(x,u)≥0.Compared with the normal form, it is worth noting that the cost function(55) is modified to reflect matched uncertainties.We assume the control inputu∈Ψ(?), where Ψ(?) is the set of admissible control laws on Ω.Then, the nonlinear Lyapunov equation can be expressed as

    According to (56), we define the Hamiltonian as

    Considering (56)–(58), the optimal cost function satisfies the HJB equation

    ADP-based robust control schemes can be divided into the following categories: least-squares-based transformation methods [79], adaptive-critic-based transformation methods [80],data-based transformation methods [81], robust ADP methods[82], [83], and so on.In [84], Wang proposed an adaptive method based on the recurrent neural network to solve the robust control problem.A cost function with the additional utility function was defined to counteract the effect of perturbations on the system and the stability of the relevant nominal system was proved.The application scope of the ADP method was further expanded.In [85], the robust control was transformed into an optimal tracking control problem by introducing an auxiliary system including a steady-state part and a transient part, and the stability of the transient tracking error was analyzed.Panget al.[86] studied the robustness of PI for addressing the continuous-time infinite-horizon LQR problem.

    B. H ∞ Control Design With ADP

    dynamicalsystemscontaining external disturbancesand

    InH∞controldesign,acontrollawisconstructedfor uncertainties.According to the principle of minimax optimality, theH∞control problem is usually described as two-playe r zero-sum differential games.In order to obtain the controller that minimizes the cost function in the worst case, we need to find the Nash equilibrium solution corresponding to the Hamilton-Jacobi-Isaacs (HJI) equation.However, for general nonlinear systems, it is hard to obtain the analytical solution of the HJI equation, which is similar to the difficulty encountered in solving the nonlinear optimal control problem.In recent years, ADP has been widely used for solvingH∞control problems.

    1)Discrete-Time Systems: Consider the following discretetime nonlinear system with external disturbances:

    We define the cost function as follows:

    In [87], [88], theH∞tracking control problem was studied by using the data-based ADP algorithm.Houet al.[87] proposed an action-disturbance-critic structure to ensure that the minimum cost function and the optimal control policy were obtained.Liuet al.[88] transformed the time-delay optimal tracking control problem with disturbances into a zero-sum game problem.An ADP-basedH∞tracking control method was proposed.A dual event-triggered constrained control scheme based on DHP [89] was used to solve the zero-sum game problem and was eventually applied to the F-16 aircraft system.A disturbance-based neural network was added to the action-critic structure by Zhonget al.[90].They relaxed the requirement for system information by defining a new type of the performance index.This approach extended the applicability of the ADP algorithm and was the first implementation of model-free globalized dual heuristic programming (GDHP).

    2)Continuous-Time Systems: Consider a class of continuous-time nonlinear systems with external disturbances

    In practical applications, the exact system dynamics are often difficult to obtain.The identification method can also produce unpredictable errors.For continuous-time unknown nonlinear zero-sum game problems, Zhuet al.[91] proposed an iterative ADP method by efficiently using online data to train the neural network.In [92], a novel distributedH∞optimal tracking control scheme was designed for a class of physically interconnected large-scale nonlinear systems in the presence of the strict-feedback form, the external disturbance, and saturating actuators.

    C. Game Design With ADP

    Modern control systems are becoming more and more complex with many decision makers, who compete and cooperate with each other.As an essential theory for multiple participants to find optimal solutions, game theory is also increasingly studied in the field of control.In accordance with the cooperation pattern among the players, it can be divided into zero-sum and nonzero-sum games, or non-cooperative and cooperative games.In the zero-sum game, the players of the game are not cooperative.However, in a nonzero-sum game,there is a possibility of cooperation among the players so that each of them gets very high performance.Similarly, game theory can be combined with ADP techniques to solve optimal control problems.With the rapid development of iterative ADP, a lot of new methods have been emerged to deal with games forNplayers [21], [93]–[99].

    1)Discrete-Time Systems: Consider a class of discrete-time systems withNplayers

    The optimal cost functions are given as

    which is known as the discrete-time HJB equation.Then, we can obtain the optimal control law

    Zhanget al.[21] combined game theory and the PI algorithm to solve the multiplayer zero-sum game problem based on ADHDP.This method not only ensured the system to achieve stability but also minimized the performance index function for each player.Songet al.[93] divided the off-policyN-coupled Hamilton-Jacobi (HJ) equations into an unknown parameter part and a system operating data part.In this way, the HJ equation can be solved without the system dynamics.Therefore, this approach was very effective for solving multiplayer non-zero-sum game problems with unknown system dynamics.For the domain shift problem,Raghavanet al.[94] compensated for the optimal desired shift by constructing a zero-sum game and proposed a direct errordriven learning scheme.

    2)Continuous-Time Systems: Consider the following continuous-time systems withNplayers:

    whereujwithj=0,1,...,Nrepresents the control input.Then, we define the cost function as

    wher eQk(x) is a positive definite function andRk jrepresents a positive definite matrix with appropriate dimensions.

    Assuming that the cost function is continuously differentiable, the Hamiltonian associated with thekth player is defined as

    and the optimal control law can be obtained by

    Inspired by zero-sum and nonzero-sum game theory, Lv and Ren [98] proposed a solution for the multiplayer mixed-zerosum nonlinear games.They defined two value functions containing performance indicators for zero-sum games and nonzero-sum games, respectively.The optimal strategy of each player was obtained without using the action network and the stability of the system was proved.In addition, Zhanget al.[99] developed a novel near-optimal control scheme for unknown nonlinear nonzero-sum differential games via the event-based ADP algorithm.

    D. Other Advanced Control Methods With ADP

    With the development of ADP technology, more and more advanced control methods have been improved.This section shows the application of ADP techniques in decentralized,distributed, and multi-agent systems.Meanwhile, the research progress related to the ADP/RL technique in the field of model predictive control (MPC) is displayed.

    Modern control systems usually consist of several subsystems with essential interconnections.It is difficult to analyze large-scale systems by using classical centralized control techniques.Therefore, using decentralized or distributed control strategies is usually preferred to solve optimal control problems for several subsystems.Yanget al.[100], [101] not only studied the decentralized stability problem subject to asymmetric constraints, but also transformed the decentralized control problem into a set of optimal control problems by introducing discounted cost functions in the auxiliary subsystems.Tonget al.[102] developed an adaptive fuzzy decentralized control method for optimal control problems of large-scale nonlinear systems with strict-feedback form.They proposed two controllers, i.e., a feedforward controller and a feedback controller, to ensure that the tracking error of the closed-loop system converges to a small range.Without using the dynamic matrix of all subsystems, Songet al.[103] developed a novel parallel PI algorithm to implement the decentralized sliding mode control scheme.

    In [104], taking the unknown discrete-time system dynamics into account, a local Q-function-based ADP method was introduced to address the optimal consensus control problem.Besides, a distributed PI technique was developed by the defined local Q function, which was proved to converge to the solutions of the coupled HJB equations.Fuet al.[105] developed a distributed optimal observer for the discrete-time nonlinear active leader with unknown dynamics.It is worth mentioning that the design of the distributed optimal observer based on ADP was developed via the action-critic framework.For the continuous-time distributed system, due to the limited transmission rate of communication channels and the limited bandwidth in some shared communication networks, time delay is an inescapable factor when dealing with the consensus problem.Therefore, in [106], for high-order integrator systems with matched external disturbances, the fixed-time leader-follower consensus problem was coped with by constructing the distributed observer.

    Jianget al.[107] estimated the leader’s state and dynamics through an adaptive distributed observer, and used a modelstate-input structure to solve the regulation equations of each follower.In addition, the stability of the system was analyzed independently.In [108], Sargolzaeiet al.introduced a Lyapunov-based method, which reduced false-data-injection attacks in real time for a centralized multi-agent system with additive disturbances and input delays.Besides, the condition of the persistence of excitation was hard to verify.Huanget al.[109] redesigned the updating laws of the action and critic components to ensure the stability of the system by introducing the persistence of excitation and additional constraints.In addition, the study of tracking control of multiagent systems has attracted significant attention due to its broad background of applications.For example, Gaoet al.[110] first integrated ADP with the internal model principle to investigate the problem of cooperative adaptive optimal tracking control.A distributed control policy based on the datadriven technique was put forward for the leader model with external disturbances.Furthermore, the stability of its closedloop system was also demonstrated.

    MPC methods mainly solve optimal control problems with constraints [111]–[118].There is a very similar theoretical scheme between ADP and MPC.The core of the two methods is to solve the optimal control problem and obtain the corresponding control policy.Furthermore, the control policy should be able to ensure stability.Therefore, the combination of MPC and ADP is a promising and important direction.In[112], Bertsekas pointed out the relationship between MPC and ADP.The core idea and mathematical essence of them were proposed based on PI.Dual-mode MPC has been combined with the action-critic structure to improve the performance and guarantee stability [113].Based on these, Huet al.[114] introduced a model predictive ADP method for path planning of unmanned ground vehicles at the road intersection.RL has been widely used in feedback control problems[115].In general, the closed-loop stability with MPC is guaranteed and various MPC strategies have been proposed.However, the performance of MPC and its stability guarantee are limited by an accurate model of the system.Accurate system models are difficult to obtain in real control systems.Generally, states and actions are continuous and it is almost impossible to represent them accurately.Therefore, function approximation tools must be used [116].Several studies combined the advantages of RL with MPC to solve optimal control problems and generated a new field [117].Zanon and Gros [118]proposed the combination of RL and MPC to exploit advantages of both methods and then obtained an optimal and safe controller.Meanwhile, it ensured the robustness of MPC based on RL.Subsequently, the data-driven MPC using RL has become an effective approach [119].

    V.BOOSTING ADP VIA DATA UTILIZATION AND RL

    The concept of RL appeared earlier than ADP.The work of psychologist Skinner and his followers studied how animals learn to change their behaviors according to the result of reward and punishment.The latest work in the field of RL still uses the traditional reward “r” instead of the utility function“U”.RL emphasizes immediate reward over the known utility function.Although the focus of ADP is different from RL and the work is relatively independent, the ideas of many methods show that they have common roots.Werbos first combined RL with DP to build a framework that approximates the Bellman equation and proposed HDP in the 1970s.The original proposition of this approach was essentially the same as the formulation of TD in RL [6].Similarly, ADHDP and Q-learning both employed the state-action function to evaluate the current policy [10].Overall, ADP/RL is a class of algorithms obtained from solving optimal control problems by approximation methods.

    Markov decision process (MDP) is a mathematical framework for obtaining the optimal decision in stochastic dynamic systems.As a key theory of RL, almost all RL problems can be modeled as MDPs.In this paper, MDP is denoted as follows:

    where S is the state set of the environment, A is the action set, P is the state transition probability, R is the reward set,and γ ∈(0,1] is the discount factor.The agent (often called controller in control theory) chooses actions to generate a trajectory sequence τ ={s0,a0,r0,s1,a1,r1,...}.In RL, the goal is to find the optimal policy that maximizes rewards or minimizes penalties for the agent to interact with the environment.

    In the early stages of ADP/RL, the theoretical and algorithmic progress was slow due to the limitations of hardware facilities and system information.The development of system identification techniques has made it available to model nonlinear systems using data-driven methods, thereby opening up a new era of research [6], [120]–[128].In [6], Lewis and Liu illustrated the contribution of stochastic encoder-decoder predictor and principal component analysis in modeling the world through a brain-like approach, as well as emphasized the importance of neural networks.Some model-based approaches have shown promising results.Lee and Lee [122] defined this type of methods as J-learning (based on the value function).The Bellman optimality equation can be expressed as

    Pang and Jiang [123] used the model-based method to discuss the robustness of PI for the LQR problem, and proposed an off-policy optimistic least-squares PI algorithm.They exploited the dynamical information of the system in the derivation process and incorporated the stochastic perturbations.Luet al.[124] demonstrated the stability of closed-loop systems using optimal parallel controllers with augmented performance index functions for tracking control.They extended the practical problems to the virtual space through parallel system theory, and used methods such as neural networks to model systems and achieve optimal control.

    However, this model-based learning approach can only be effective in the state space based on empirical information.The calculated control actions and performance predictions are constrained by the amount of information.Different from the model-based learning method, Q-learning proposed by Warkins and Dayan [125] used the Q function to represent the value of the action in the current state.This type of function already contains information about the system and the utility function.Compared with J-learning, it is easier to obtain control policies by using Q-learning, especially for unknown nonlinear systems.The Bellman optimality equation can be expressed as

    Note that the above formula is described for deterministic systems.Liet al.[126] solved the optimal switching problem of autonomous subsystems and analyzed the boundedness of the approximation error in the iterative process.Jianget al.[127] used Q-learning to improve the convergence speed of optimal policies for path planning and obstacle avoidance problems.In [95], a new off-policy model-free approach was used to study the networked multi-player game.At the same time, they achieved optimal control in systems with networkinduced delays and demonstrated the convergence of the algorithm.In addition, Penget al.[96] proposed an internal reinforce Q-learning scheme, and analyzed the convergence and system stability related to the iterative algorithm.Based on local information from neighbors, they designed a special internal reward signal to enhance the agent’s ability to receive long-term information.The model-free idea applied in the field of control is only the tip of the iceberg.

    TD is an RL algorithm that can learn directly from the environment without requiring the complete trajectory sequence.Sarsa and Q-learning are two classic TD algorithms.The Sarsa improves and evaluates the same algorithm (on-policy).However, Q-learning uses data sampled from other policies to improve the target policy (off-policy).Next, we introduce two accelerated methods that can be applied to TD algorithms.

    The first method is experience replay which is mainly used to overcome the problems of correlated data and non-stationary distribution.It can improve data utilization efficiency[129], [130].Pieters and Wiering [131] proposed an algorithm combining experience replay with Q-learning.The simulation results showed that the performance of the algorithm was significantly improved over the traditional Q-learning algorithm.Experience replay technique is not only used in Qlearning but also can be combined with other deep RL algorithms, which have achieved good performance in improving convergence speed and data utilization efficiency [132].Many scholars in the field of control are inspired to combine ADP algorithm with experience replay to improve the performance of the algorithm.For discrete-time nonlinear systems, Luoet al.[133] designed a model-free optimal tracking controller by using policy gradient ADP designs with experience replay.It was realized based on the action-critic structure, which was applied to approximate the iterative Q function and the iterative control policy.The convergence of the iterative algorithm was established through theoretical analysis.

    The second method is called eligibility traces.The traditional Q-learning is the case with only one-step estimate.If more information on traces is considered, updating the policy will be more efficacious [134].The eligibility traces method can combine multi-step information to update unknown parameters.Eligibility traces were first introduced into the TD learning process to form an efficient learning algorithm named TD( λ) in [135].Considering the direction of the trace, there are forward view and backward view, respectively.Although the expressions of two algorithms are different, their intrinsic essences are the same.In engineering, backward view is generally adopted for the convenience of calculation.Inspired by the field of RL, many scholars combine ADP with both forward view and backward view of eligibility traces.Compared with the traditional ADP algorithms, the performance of these algorithms has been significantly improved [136].Al-Dabooni and Wunsch [137] proposed a forward view ADHDP( λ) algorithm by combining ADHDP with the eligibility traces and proved the UUB under certain conditions.Yeet al.[138] proposed a more accurate and faster algorithm by introducing backward view eligibility traces into GDHP.Meanwhile, the superiority of computational efficiency was verified by the simulation analysis.

    In addition, inverse RL [139], [140] has received extensive attention in academia in recent years.This theory is able to solve inverse problems in control systems, machine learning,and optimization.Unlike methods that directly map from states to control inputs or use system identification to learn control policies, inverse RL methods attempt to reconstruct a more adaptive reward function.This reward function prevents small changes in the environment from making the policy unusable.Lianet al.[141] used the inverse RL method to solve the two-person zero-sum game problem, and established two algorithms according to whether the model is used or not.Overall, RL has achieved remarkable success for some complex problems [2].RL has also attracted a lot of attention from a control point of view due to the model-free property and interaction with real-world scenarios.With the application of RL algorithms in the control field, advanced methods based on learning and environmental interaction will demonstrate more powerful capabilities in future works.

    VI.TYPICAL APPLICATIONS OF ADP AND RL

    Compared with other optimal control methods, ADP has significant advantages in dealing with complex nonlinear systems.Due to the strong ability, ADP is widely used in many fields such as wastewater treatment, smart power grid, intelligent transportation, aerospace, aircraft, robotics, and logistics.

    A. Wastewater Treatment Applications

    The control of the wastewater treatment process is a typical complex nonlinear control problem, and it is also one of the difficulties in the field of process control.Accompanied by a large number of interferences, biochemical reaction mechanisms are very complex.There are many factors that can influence the effect of wastewater treatment, such as the dissolved oxygen concentration and the nitrate concentration.A large part of research is based on the Benchmark Simulation Model No.1 (BSM1) platform for verification.The goal of designing the controller is to reduce energy consumption and cost as much as possible to ensure the effluent quality meets the national discharge standard and the stable operation of the device.The design framework for control of wastewater treatment plants is shown in Fig.3.

    Fig.3.The design framework for control of wastewater treatment plants.

    Control of a single variable has been considered.For example, the online ADP scheme was proposed in [142] by using the echo state network as the function approximation tool.The high-performance control of dissolved oxygen variable in the wastewater treatment plant was realized.

    To improve the efficiency of wastewater treatment, many scholars consider both the dissolved oxygen and nitrogen concentration.For example, Wanget al.[67] combined the DHP algorithm and the event-triggered mechanism to improve resource utilization and applied it to multi-variable tracking control of wastewater treatment.By using PI and the experience replay mechanism, Yanget al.[129] proposed a dynamic priority policy gradient ADP method and applied it to solve multi-variable control of wastewater treatment without the system model.

    In the process of wastewater treatment, the setpoint of operating variables is generally set by manual experience.Considering the uncertain environment and disturbance factors, manual experience is often difficult to adapt to different industrial conditions, and it is difficult to balance energy consumption and water quality during operation.Many scholars have studied the optimization of the wastewater treatment process.Qiaoet al.[143] developed an online optimization control method,which not only met the requirements of effluent water quality,but also reduced the operating cost of the system.For the setpoint of dissolved oxygen, a model-free RL algorithm [144]was proposed that could learn autonomously and actively adjust the setpoint of dissolved oxygen.

    B. Power System Applications

    Power system is a kind of complex nonlinear plants with multiple variables.The emergence of smart grid has opened up a new direction of power systems.The smart grid design includes renewable energy generation, transmission, storage,distribution, and optimization of household appliances, and so on.

    Recently, the ADP/RL algorithm has been widely used in the field of the smart grid due to its advantages.An ADHDP method was applied to solve the residential energy scheduling problem [145], which effectively improves the power consumption efficiency.For multi-battery energy storage systems with time-varying characteristics, a new ADP-based algorithm was proposed in [146].The robust stabilization of mismatched nonlinear systems was achieved by combining auxiliary systems and policy learning techniques under the condition of dynamic uncertainties [83].Experimental verification was carried out on a power system.An adaptive optimal datadriven control method was presented based on ADP/RL for three-phase grid-connected inverter of the virtual synchronous generator [147].To ensure the stable operation of smart grids with load variations and multiple renewable generations, a robust intelligent algorithm was proposed in [148].It utilized a neural identifier to reconstruct the unknown dynamical system and derived approximate optimal control and worst-case disturbance laws.Wanget al.[22] proposed an ADP method with augmented terms based on the GrHDP framework.They constructed new weight updating rules by adding adjustable parameters and successfully applied them to a large power system.

    C. Other Applications

    The ADP method has also been applied to other fields such as intelligent transportation [149], [150], robotics [7], [51],[127], [151], aerospace [152], [153], smart homes [154],[155], and cyber security [156]–[160], among others.Liuet al.[149] proposed a distributed computing method to implement switch-based ADP and verified the effectiveness of the method by using two cases of urban traffic and architecture.The method divided the system into multiple agents.To avoid switching policy conflicts, a heuristic algorithm was proposed based on consensus dynamics and Nash equilibrium.Wenet al.[151] combined ADP with RL to propose a direct online HDP approach for knee robot control and clinical application in human subjects.For the optimal attitude-tracking problem of hypersonic vehicles, Hanet al.[152] and Zhaoet al.[153] developed a novel PI algorithm and an observation-based RL framework, respectively, which ensured the system stability in the presence of random disturbances.Weiet al.[154] proposed a deep RL method to control the air conditioning system by recognizing facial expression information to improve the work efficiency of employees.Hosseinlooet al.[155] established an event-based microclimate control algorithm to achieve an optimal balance between energy consumption and occupant comfort.With the widespread application of cyber-physical systems, their security issues have received wide attention.Nguyen and Reddi [156] provided a very comprehensive survey of RL technology routes for cyber security and discussed future research directions.For nonlinear discrete-time systems with event-triggered [157] and stochastic communication protocols [158], Wanget al.constructed different action-critic frameworks and discussed the boundedness of the error and the stability of the system based on Lyapunov theory, respectively.More and more ADP-based methods [159], [160] are focusing on improving cyber security.With the rapid development of ADP/RL, its applications will be more extensive.

    VII.SUMMARY AND PROSPECT

    ADP and RL have made significant progress in theoretical research and practical applications, showing great potential in future tasks.This paper explores the theoretical work and application scenarios by analyzing discrete-time and continuous-time systems, focusing on developing advanced intelligent learning and control.With the current complex system environment and tasks, there are still many theoretical and algorithmic problems that have not yet been solved.Through the present analysis of ADP, this paper concludes several essential directions.

    1) Most of the current ADP schemes assume that the function approximation process is exact.However, with the increase of the number of network layers and iterations, the approximation error caused by the function approximator can not be ignored.In the actual iterative process, each step of the function approximator results in an approximation error that propagates to the next iteration.In other words, these approximation errors may change in future iterations, leading to the emergence of a “resonance” type phenomenon, and affecting the reliability of the solution.Therefore, both theoretical and practical applications of ADP need to consider the convergence of ADP algorithms in the presence of approximation errors in policy evaluation and policy improvement.

    2) The ADP approach currently addresses mostly systems with low-dimensional states and controls.There is no effective solution for high-dimensional, continuous state and control spaces in real complex systems.With the development of RL and even deep RL, the optimal regulation and trajectory tracking for high-dimensional systems are possible using big data technology.It is important to propose ADP methods with fast convergence and low computational complexity by introducing different forms of relaxation factors.

    3) It is of great importance to utilize advanced network technologies to decrease communication traffic and prolong device lifespan.The round-robin protocol, the try-once-discard protocol, the stochastic communication protocol, and the event-triggered protocol are essential in improving performance and saving resources.Based on these protocols, the combination of the ADP technology with decentralized control, robust control, and MPC is crucial in achieving optimal control while minimizing resource consumption.

    4) In recent years, the study of brain science and brain-like intelligence has attracted significant interest from researchers worldwide.The optimality theory is closely related to the study of understanding brain intelligence.Most organisms in nature want to conserve limited resources and achieve their goals in parallel optimally.It is important to consider brainlike intelligence to extend ADP and attain optimal decision and intelligent control of complex systems in an online method.To ensure the stability, convergence, optimality, and robustness of the brain-like intelligence algorithms for ADP, it still requires efforts of a large number of scholars.

    5) The field of ADP has a wealth of results that can guide many systems in a theoretical sense to achieve optimal objectives.In practice, however, for a large number of nonlinear systems, abrupt changes in control inputs and the construction of dynamical systems are extremely challenging.Parallel control can be seen as a virtual reality interactive control method.It reconstructs the actual system based on the real input and output data.By combining ADP with parallel control, the control strategy will be greatly improved for real physical systems in the future.

    国产蜜桃级精品一区二区三区| 日本成人三级电影网站| 尾随美女入室| 亚洲精品日韩在线中文字幕 | 亚洲人成网站高清观看| 三级男女做爰猛烈吃奶摸视频| 午夜福利在线在线| 成人漫画全彩无遮挡| 日韩高清综合在线| 十八禁国产超污无遮挡网站| 国产精品乱码一区二三区的特点| 欧美日韩精品成人综合77777| 成年女人永久免费观看视频| 一级黄片播放器| 国产av麻豆久久久久久久| 国产亚洲精品久久久久久毛片| 亚洲精品成人久久久久久| 最近视频中文字幕2019在线8| 国产淫片久久久久久久久| 亚洲人成网站在线播| 久久久久久久久久成人| 波多野结衣高清无吗| 国产精品国产三级国产av玫瑰| 又粗又硬又长又爽又黄的视频 | 久久草成人影院| 毛片女人毛片| 久久国内精品自在自线图片| 日韩亚洲欧美综合| 午夜激情欧美在线| 蜜桃亚洲精品一区二区三区| 直男gayav资源| 成人一区二区视频在线观看| 99精品在免费线老司机午夜| 欧美日本视频| 国产精品.久久久| 悠悠久久av| 亚洲综合色惰| 老司机影院成人| 十八禁国产超污无遮挡网站| 日本黄色视频三级网站网址| 亚洲av成人av| 狂野欧美白嫩少妇大欣赏| 联通29元200g的流量卡| 此物有八面人人有两片| 国产黄a三级三级三级人| 国内精品宾馆在线| 午夜视频国产福利| 日韩av不卡免费在线播放| 亚洲av免费在线观看| 久久久久九九精品影院| 国产视频内射| 97超碰精品成人国产| 三级经典国产精品| 神马国产精品三级电影在线观看| 午夜亚洲福利在线播放| 欧洲精品卡2卡3卡4卡5卡区| h日本视频在线播放| avwww免费| 亚洲av第一区精品v没综合| 干丝袜人妻中文字幕| 尾随美女入室| av免费观看日本| 午夜亚洲福利在线播放| 九九爱精品视频在线观看| 看免费成人av毛片| 久久婷婷人人爽人人干人人爱| 久久韩国三级中文字幕| 国产在线精品亚洲第一网站| 寂寞人妻少妇视频99o| 亚洲熟妇中文字幕五十中出| 国产成人aa在线观看| 麻豆一二三区av精品| 欧美一区二区精品小视频在线| 色吧在线观看| 久久久色成人| 青春草视频在线免费观看| 麻豆久久精品国产亚洲av| 亚洲国产欧美在线一区| 久久久久性生活片| 亚洲国产精品合色在线| 日韩制服骚丝袜av| 亚洲国产精品成人久久小说 | 中国国产av一级| 人妻系列 视频| 亚洲欧美成人精品一区二区| 五月伊人婷婷丁香| 高清毛片免费观看视频网站| 国产激情偷乱视频一区二区| 老熟妇乱子伦视频在线观看| 可以在线观看的亚洲视频| 卡戴珊不雅视频在线播放| 成人午夜高清在线视频| 99久国产av精品国产电影| 嘟嘟电影网在线观看| 国产色婷婷99| 99久久九九国产精品国产免费| 日本免费一区二区三区高清不卡| 五月伊人婷婷丁香| 有码 亚洲区| 亚洲五月天丁香| 午夜激情福利司机影院| 高清毛片免费看| 日本成人三级电影网站| 亚洲国产欧美人成| 女的被弄到高潮叫床怎么办| 精品不卡国产一区二区三区| 久久久精品94久久精品| 精品久久久久久成人av| 人体艺术视频欧美日本| 精品久久久久久久久av| 人妻系列 视频| 久久精品国产亚洲av涩爱 | 亚洲成人久久爱视频| 午夜视频国产福利| 3wmmmm亚洲av在线观看| 亚洲四区av| 啦啦啦韩国在线观看视频| 欧美日韩综合久久久久久| 久久热精品热| 国产伦精品一区二区三区视频9| 国产视频首页在线观看| 夜夜夜夜夜久久久久| 日韩国内少妇激情av| 五月伊人婷婷丁香| 一本一本综合久久| 91狼人影院| 黄片wwwwww| 97热精品久久久久久| 国产成人a区在线观看| 12—13女人毛片做爰片一| 欧美性猛交黑人性爽| 欧美丝袜亚洲另类| 人妻制服诱惑在线中文字幕| 国产伦精品一区二区三区视频9| 午夜福利视频1000在线观看| 赤兔流量卡办理| 亚洲自拍偷在线| 欧美成人a在线观看| 高清在线视频一区二区三区 | 天天躁夜夜躁狠狠久久av| 日韩强制内射视频| 国产乱人偷精品视频| 亚洲一区二区三区色噜噜| 丰满乱子伦码专区| 高清在线视频一区二区三区 | 2021天堂中文幕一二区在线观| 精品免费久久久久久久清纯| 极品教师在线视频| 精品99又大又爽又粗少妇毛片| 99热只有精品国产| 国产黄a三级三级三级人| 91在线精品国自产拍蜜月| 1024手机看黄色片| 成年av动漫网址| 国产精品不卡视频一区二区| 久久久精品94久久精品| 欧美精品国产亚洲| 亚洲一区高清亚洲精品| 亚洲国产高清在线一区二区三| 哪里可以看免费的av片| 国产精品蜜桃在线观看 | 亚洲av.av天堂| 麻豆成人午夜福利视频| 国产午夜精品论理片| 一区福利在线观看| 精品人妻熟女av久视频| 欧美最新免费一区二区三区| 长腿黑丝高跟| 中出人妻视频一区二区| 国产女主播在线喷水免费视频网站 | 亚洲av不卡在线观看| 熟女电影av网| 天美传媒精品一区二区| 大又大粗又爽又黄少妇毛片口| 自拍偷自拍亚洲精品老妇| 性插视频无遮挡在线免费观看| 精品久久久久久久久久免费视频| 久久国产乱子免费精品| 亚洲国产色片| 国产毛片a区久久久久| 久久中文看片网| 国产老妇伦熟女老妇高清| 国产亚洲精品av在线| 久久这里只有精品中国| 三级毛片av免费| 白带黄色成豆腐渣| 狂野欧美激情性xxxx在线观看| 欧美极品一区二区三区四区| 日韩一本色道免费dvd| 狠狠狠狠99中文字幕| av在线老鸭窝| www日本黄色视频网| 日韩亚洲欧美综合| 伊人久久精品亚洲午夜| 成人特级av手机在线观看| 99久国产av精品| 在线播放无遮挡| 成熟少妇高潮喷水视频| 赤兔流量卡办理| 久久久久久久久久久免费av| www日本黄色视频网| av视频在线观看入口| 麻豆av噜噜一区二区三区| 波多野结衣巨乳人妻| 欧美一区二区精品小视频在线| 99在线人妻在线中文字幕| 久久午夜亚洲精品久久| 天堂网av新在线| 成人特级黄色片久久久久久久| 丰满的人妻完整版| 男女视频在线观看网站免费| 99久国产av精品| 黄色配什么色好看| 26uuu在线亚洲综合色| 嫩草影院新地址| 国产v大片淫在线免费观看| 熟女电影av网| 69av精品久久久久久| 99精品在免费线老司机午夜| 黄片wwwwww| 人妻少妇偷人精品九色| 三级男女做爰猛烈吃奶摸视频| a级毛片a级免费在线| 国产久久久一区二区三区| 22中文网久久字幕| 免费观看的影片在线观看| 国产伦一二天堂av在线观看| 亚洲精品粉嫩美女一区| 欧美丝袜亚洲另类| 看黄色毛片网站| 成人毛片60女人毛片免费| 日韩制服骚丝袜av| 午夜福利在线观看吧| 高清毛片免费看| 国产成人精品久久久久久| 精品免费久久久久久久清纯| 少妇的逼水好多| 国产伦精品一区二区三区视频9| 精品无人区乱码1区二区| 欧美区成人在线视频| 成人漫画全彩无遮挡| 国产精品麻豆人妻色哟哟久久 | 欧美+日韩+精品| av又黄又爽大尺度在线免费看 | 国产精品永久免费网站| 欧美日韩在线观看h| 日本欧美国产在线视频| 夫妻性生交免费视频一级片| 岛国在线免费视频观看| 麻豆成人av视频| 国产一区亚洲一区在线观看| 高清在线视频一区二区三区 | 久久99热这里只有精品18| 免费在线观看成人毛片| 悠悠久久av| 六月丁香七月| 国产精品乱码一区二三区的特点| 18禁在线播放成人免费| 久久久欧美国产精品| 丰满人妻一区二区三区视频av| 久久久国产成人精品二区| 成人毛片60女人毛片免费| 久久精品综合一区二区三区| 搡老妇女老女人老熟妇| 色噜噜av男人的天堂激情| 一级毛片久久久久久久久女| 久久久国产成人精品二区| 大香蕉久久网| 成年女人永久免费观看视频| 亚洲欧美日韩高清在线视频| 高清日韩中文字幕在线| 亚洲不卡免费看| 亚洲欧美精品专区久久| 久久久久性生活片| 又粗又爽又猛毛片免费看| 成人无遮挡网站| 熟妇人妻久久中文字幕3abv| 91久久精品国产一区二区三区| 观看免费一级毛片| 久久久久久久久大av| 又爽又黄无遮挡网站| 男的添女的下面高潮视频| 18禁裸乳无遮挡免费网站照片| 一个人看视频在线观看www免费| 国产精品日韩av在线免费观看| 一本精品99久久精品77| 国产高清激情床上av| 日韩一区二区三区影片| 国产爱豆传媒在线观看| 国产片特级美女逼逼视频| 人人妻人人澡人人爽人人夜夜 | 全区人妻精品视频| 国产精品久久久久久av不卡| 亚洲,欧美,日韩| 色综合站精品国产| 亚洲丝袜综合中文字幕| 哪里可以看免费的av片| 成人综合一区亚洲| 五月玫瑰六月丁香| 男女那种视频在线观看| 国产精品免费一区二区三区在线| 尤物成人国产欧美一区二区三区| 久久这里只有精品中国| 国产91av在线免费观看| 婷婷色综合大香蕉| 蜜桃亚洲精品一区二区三区| 国产真实伦视频高清在线观看| 久99久视频精品免费| 精品熟女少妇av免费看| 免费搜索国产男女视频| 99久久九九国产精品国产免费| 欧美一区二区亚洲| 18+在线观看网站| 床上黄色一级片| 又黄又爽又刺激的免费视频.| 男女下面进入的视频免费午夜| 性欧美人与动物交配| 91午夜精品亚洲一区二区三区| 久久九九热精品免费| 精品不卡国产一区二区三区| 成人无遮挡网站| 看十八女毛片水多多多| 天天躁日日操中文字幕| 人人妻人人看人人澡| 哪个播放器可以免费观看大片| 好男人视频免费观看在线| 国产成人a区在线观看| 自拍偷自拍亚洲精品老妇| 国产成人aa在线观看| 色哟哟哟哟哟哟| 一本久久精品| 久久精品91蜜桃| 午夜福利在线观看吧| 少妇高潮的动态图| 99久久无色码亚洲精品果冻| 三级经典国产精品| 18禁在线无遮挡免费观看视频| 天天一区二区日本电影三级| 久久久精品94久久精品| 亚洲美女搞黄在线观看| 五月伊人婷婷丁香| 人妻久久中文字幕网| 91av网一区二区| 干丝袜人妻中文字幕| 99国产极品粉嫩在线观看| 狂野欧美白嫩少妇大欣赏| 日韩欧美 国产精品| 男插女下体视频免费在线播放| 亚洲成人av在线免费| 91久久精品国产一区二区成人| 久久久久久久久久成人| 狠狠狠狠99中文字幕| av在线亚洲专区| 深夜精品福利| 成年女人永久免费观看视频| 久久久久久久久久久丰满| 女同久久另类99精品国产91| АⅤ资源中文在线天堂| 99久国产av精品| 男的添女的下面高潮视频| 欧美色视频一区免费| 不卡一级毛片| 国产淫片久久久久久久久| 不卡一级毛片| 能在线免费观看的黄片| 寂寞人妻少妇视频99o| 禁无遮挡网站| 人妻夜夜爽99麻豆av| 日韩亚洲欧美综合| 黄色一级大片看看| 国国产精品蜜臀av免费| 国内精品久久久久精免费| 亚洲国产欧美在线一区| 美女被艹到高潮喷水动态| 天堂√8在线中文| 国内精品久久久久精免费| 国产男人的电影天堂91| 99精品在免费线老司机午夜| 亚洲精品色激情综合| 国产在线精品亚洲第一网站| 五月玫瑰六月丁香| 免费一级毛片在线播放高清视频| or卡值多少钱| 欧美人与善性xxx| 日本av手机在线免费观看| 熟妇人妻久久中文字幕3abv| 99国产精品一区二区蜜桃av| 成人亚洲欧美一区二区av| 国产成人影院久久av| 亚洲国产高清在线一区二区三| 禁无遮挡网站| 成人毛片a级毛片在线播放| 久99久视频精品免费| 91午夜精品亚洲一区二区三区| 午夜激情福利司机影院| 国产真实伦视频高清在线观看| 国产精品伦人一区二区| 男人舔女人下体高潮全视频| 国产av一区在线观看免费| 久久久久久久久大av| 国产色爽女视频免费观看| 插逼视频在线观看| 国产精品女同一区二区软件| www.av在线官网国产| 国产精品久久久久久av不卡| av.在线天堂| 精品欧美国产一区二区三| 寂寞人妻少妇视频99o| 国产真实乱freesex| 亚洲电影在线观看av| 禁无遮挡网站| 日韩av不卡免费在线播放| 韩国av在线不卡| 99国产极品粉嫩在线观看| 欧美丝袜亚洲另类| 日韩中字成人| 老师上课跳d突然被开到最大视频| 日韩欧美三级三区| 免费一级毛片在线播放高清视频| 欧美一区二区亚洲| 亚洲最大成人中文| 成人无遮挡网站| 午夜福利在线观看免费完整高清在 | 中文在线观看免费www的网站| 97超视频在线观看视频| 国产av麻豆久久久久久久| 国产一区二区三区在线臀色熟女| 天天一区二区日本电影三级| 亚洲精品久久国产高清桃花| 91久久精品电影网| 日本撒尿小便嘘嘘汇集6| 村上凉子中文字幕在线| 中文字幕免费在线视频6| 亚洲综合色惰| 成人欧美大片| 偷拍熟女少妇极品色| 欧美激情久久久久久爽电影| 美女cb高潮喷水在线观看| 一区二区三区高清视频在线| 久久久久性生活片| 在线观看一区二区三区| 久久久久久久亚洲中文字幕| 99久久精品热视频| 国产一级毛片在线| 日本黄色视频三级网站网址| 我的老师免费观看完整版| 日日摸夜夜添夜夜爱| 国产午夜精品一二区理论片| 在线免费观看的www视频| 亚洲美女搞黄在线观看| 免费不卡的大黄色大毛片视频在线观看 | 丰满乱子伦码专区| 人妻制服诱惑在线中文字幕| 午夜免费激情av| 国产高潮美女av| 99热这里只有是精品在线观看| 国产一区二区在线观看日韩| 久久人妻av系列| 国产一区二区激情短视频| 国产午夜精品论理片| 色播亚洲综合网| 免费观看在线日韩| 免费在线观看成人毛片| 午夜a级毛片| 日本一本二区三区精品| 嫩草影院精品99| 黄色一级大片看看| 99久国产av精品国产电影| 男人舔奶头视频| 晚上一个人看的免费电影| 免费av毛片视频| 国产精品电影一区二区三区| 久久99热6这里只有精品| 美女大奶头视频| 精品人妻一区二区三区麻豆| 国产成人福利小说| 一个人看视频在线观看www免费| 好男人视频免费观看在线| 两个人的视频大全免费| 国产精品久久久久久精品电影小说 | 欧美另类亚洲清纯唯美| 亚洲精品色激情综合| 久久久欧美国产精品| 女人十人毛片免费观看3o分钟| 欧美区成人在线视频| 一进一出抽搐动态| 26uuu在线亚洲综合色| 又黄又爽又刺激的免费视频.| 91久久精品国产一区二区三区| 亚洲国产精品合色在线| 老司机影院成人| 亚洲乱码一区二区免费版| 免费黄网站久久成人精品| 亚州av有码| kizo精华| 欧美成人a在线观看| 国产女主播在线喷水免费视频网站 | 激情 狠狠 欧美| 国产黄片美女视频| 人妻少妇偷人精品九色| 看非洲黑人一级黄片| 亚洲在线自拍视频| 成人漫画全彩无遮挡| 在线a可以看的网站| 国产精品美女特级片免费视频播放器| 看非洲黑人一级黄片| 黄色欧美视频在线观看| 日本av手机在线免费观看| 你懂的网址亚洲精品在线观看 | 国产精品不卡视频一区二区| 一区二区三区免费毛片| 22中文网久久字幕| 精品久久久久久成人av| 麻豆国产97在线/欧美| 成人高潮视频无遮挡免费网站| 久久九九热精品免费| 国产精品福利在线免费观看| 老师上课跳d突然被开到最大视频| 综合色av麻豆| 亚洲成人av在线免费| 美女 人体艺术 gogo| 26uuu在线亚洲综合色| 联通29元200g的流量卡| 99国产精品一区二区蜜桃av| 国产高潮美女av| 精品日产1卡2卡| 深夜精品福利| 内射极品少妇av片p| 春色校园在线视频观看| 村上凉子中文字幕在线| 又黄又爽又刺激的免费视频.| 欧美性感艳星| 欧美三级亚洲精品| 亚洲一区高清亚洲精品| 亚洲自偷自拍三级| 国产色婷婷99| 国产一区二区在线av高清观看| 日本免费a在线| 国产色婷婷99| 久久精品91蜜桃| 午夜老司机福利剧场| 欧美日韩综合久久久久久| 一本久久精品| 村上凉子中文字幕在线| 亚洲av成人精品一区久久| 深爱激情五月婷婷| 亚洲第一区二区三区不卡| 性欧美人与动物交配| 精品午夜福利在线看| 国产精品乱码一区二三区的特点| 国产淫片久久久久久久久| 色噜噜av男人的天堂激情| 日韩一区二区视频免费看| 一个人观看的视频www高清免费观看| 午夜福利高清视频| 日本在线视频免费播放| 我要看日韩黄色一级片| 亚洲电影在线观看av| 一级av片app| 欧美精品国产亚洲| 99久久九九国产精品国产免费| 可以在线观看的亚洲视频| 久久亚洲精品不卡| 国产精品人妻久久久久久| 日韩高清综合在线| 深夜a级毛片| 青春草视频在线免费观看| 亚洲国产精品成人综合色| 男女啪啪激烈高潮av片| av黄色大香蕉| 免费看美女性在线毛片视频| 国产精品一区二区性色av| 一卡2卡三卡四卡精品乱码亚洲| 国国产精品蜜臀av免费| 国产老妇女一区| 女人被狂操c到高潮| 亚洲熟妇中文字幕五十中出| 国产毛片a区久久久久| 亚洲人与动物交配视频| 亚洲av.av天堂| 内射极品少妇av片p| av女优亚洲男人天堂| 男女做爰动态图高潮gif福利片| 在线播放无遮挡| 久久九九热精品免费| 国产不卡一卡二| 国产成人午夜福利电影在线观看| 美女cb高潮喷水在线观看| www日本黄色视频网| 国产在线男女| 男人的好看免费观看在线视频| 少妇人妻一区二区三区视频| 免费观看精品视频网站| 性插视频无遮挡在线免费观看| 自拍偷自拍亚洲精品老妇| 亚洲欧美日韩高清在线视频| 91久久精品国产一区二区成人| 亚洲精品久久久久久婷婷小说 | 色尼玛亚洲综合影院| 欧美性感艳星| 亚洲成人久久性| 久久久久久久久久久丰满| 在线观看美女被高潮喷水网站| 国内精品美女久久久久久| 高清毛片免费看| 久久综合国产亚洲精品| 欧美精品一区二区大全| av黄色大香蕉| 狠狠狠狠99中文字幕| 此物有八面人人有两片| 亚洲人成网站高清观看| 色播亚洲综合网| 亚洲乱码一区二区免费版| 成人一区二区视频在线观看| 中国美女看黄片|