• 
    

    
    

      99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

      Cooperative and Competitive Multi-Agent Systems:From Optimization to Games

      2022-05-23 03:01:04JianruiWangYitianHongJialiWangJiapengXuYangTangQingLongHanandrgenKurths
      IEEE/CAA Journal of Automatica Sinica 2022年5期

      Jianrui Wang, Yitian Hong, Jiali Wang, Jiapeng Xu, Yang Tang,,Qing-Long Han,, and Jürgen Kurths

      Abstract—Multi-agent systems can solve scientific issues related to complex systems that are difficult or impossible for a single agent to solve through mutual collaboration and cooperation optimization. In a multi-agent system, agents with a certain degree of autonomy generate complex interactions due to the correlation and coordination, which is manifested as cooperative/competitive behavior. This survey focuses on multi-agent cooperative optimization and cooperative/non-cooperative games.Starting from cooperative optimization, the studies on distributed optimization and federated optimization are summarized. The survey mainly focuses on distributed online optimization and its application in privacy protection, and overviews federated optimization from the perspective of privacy protection mechanisms. Then, cooperative games and non-cooperative games are introduced to expand the cooperative optimization problems from two aspects of minimizing global costs and minimizing individual costs, respectively. Multi-agent cooperative and noncooperative behaviors are modeled by games from both static and dynamic aspects, according to whether each player can make decisions based on the information of other players. Finally,future directions for cooperative optimization, cooperative/noncooperative games, and their applications are discussed.Index Terms—Cooperative games, counterfactual regret minimization, distributed optimization, federated optimization, fictitious self-play, mean field games, multi-agent reinforcement learning, noncooperative games.

      I. INTRODUCTION

      IN the past few decades, multi-agent systems have received considerable attention [1]–[3] due to the development of automatic control, computer technology and artificial intelligence. Multi-agent systems are able to solve complex problems through interactions between agents, and successfully improve the efficiency and robustness [4]–[6],which is a current research hotspot in various fields, such as intelligent transportation [7] and smart grids [8]. In multiagent systems, the interactions between agents may be reflected in cooperation and/or competition. There are several factors that influence the cooperative/competitive behaviors of the agents, such as cost functions, neighboring information and environmental factors, which affect the performance of multi-agent systems in the optimization and decision-making process [9]. This survey analyzes cooperative/non-cooperative behaviors of multi-agent systems from optimization to games.

      From the perspective of cooperative optimization, agents cooperate and diffuse information so as to obtain a globally optimal solution [10]. Cooperative optimization can be divided into centralized optimization and distributed optimization from the perspective of network structures [11].Due to the advantages of processing large-scale data and improving system robustness [12], this survey focuses on distributed optimization. Since there is no prior knowledge of the objective functions when the information is highly uncertain and unpredictable in many applications, distributed online optimization is put forward and able to address this problem. The survey summarizes recent work on distributed online optimization and its applications concerned with privacy protection. Afterwards, specifically for the privacy protection issues, federated optimization [13] is introduced to provide additional privacy protection by leveraging the computing power of users to save significant network bandwidth.

      In multi-agent systems, due to the heterogeneity of agents,the complexity of working environments, and the diversity of system objectives, game theory is introduced to model the cooperative/competitive behaviors for individual/global optimization goals. Games can be divided into cooperative games and non-cooperative games, judging from the cooperative/non-cooperative behaviors of agents. Similarly,games can also be classified into static games and dynamic games according to the action sequence of agents [14].Cooperative games are used to ensure cooperation behavior by considering minimizing the global cost function [15]. In static cooperative games, agents seek for Pareto optimality through multi-objective optimization [16]. While in dynamic cooperative games, Markov decision processes are generalized to multiple cooperating decision-makers, and agents obtain a series of optimal strategies [17].

      Fig. 1. The overall structure of the survey.

      However, agents may not cooperate in all situations, which means that there may exist competitions in multi-agent systems, such as pursuit-evasion games [18] and capture-theflag games [19], [20]. Unlike cooperative games, noncooperative games maximize individual payoffs or minimize individual costs to achieve the Nash equilibrium (NE) [14]. In static non-cooperative games, complete information is usually considered to analytically solve the NE solution. In practice,however, agents do not necessarily know what their exact costs are or what types of opponents they are. Therefore,considering the situation of incomplete information, Bayesian games are useful to obtain the NE solution [21]. Similarly, in dynamic non-cooperative games, under the condition of complete information, researchers often minimize the cost function and maximize the reward function to find the optimal strategy for normal-form games [14], and construct a subgame and solve the NE through Nash backward induction [22]for extensive-form games. Under the condition of incomplete information, game-theoretic solutions, such as regret matching[23] and fictitious play [24], are widely used and combined with artificial intelligence algorithms, such as reinforcement learning (RL) [25] and deep learning [26]. Moreover, mean field games (MFG) are considered to analytically solve the NE solution for large-scale dynamic non-cooperative games[27]. In recent years, researches on dynamic non-cooperative games have made great breakthroughs, such as AlphaZero[28], Pluribus [29] and AlphaStar [30].

      In this survey, we analyze the cooperative and competitive of multi-agent systems from three aspects: Cooperative optimization, cooperative games, and non-cooperative games,and summarize future works and applications. At present,many researchers have published surveys on cooperative or/and non-cooperative multi-agent systems. Unlike the survey on cooperative optimization [31], we focus on the recent work of distributed online optimization and the research of distributed online optimization and federated optimization in privacy protection; unlike the survey on games[32], we focus on both cooperative games and noncooperative games from the perspective of static games and dynamic games, respectively; unlike the survey on both cooperative optimization and games [33], we bridge the transition from cooperative optimization to games, that is,cooperative games, which aims at analyzing the emergence of cooperative behavior.

      This survey is organized as follows. Section II introduces cooperative optimization composed of distributed optimization and federated optimization. Section III focuses on cooperative games and non-cooperative games, from static and dynamic aspects respectively. Finally, Section IV summarizes potential future work and applications. The overall structure of this survey is reflected in Fig. 1, and the abbreviations in this survey are summarized in Table I.

      II. COOPERATIVE OPTIMIzATION

      Cooperative optimization is inspired by the principleof teamwork, which considers the interactions among agents to optimize variables, and provided with the objective function.Cooperative optimization has a solid theoretical foundation and outstanding practical performance, which has aroused widespread attention and payoffs. In this section, we mainly focus on two kinds of cooperative optimizations with different structures but similar fields of application, namely distributed optimization [31] and federated optimization [13]. The differences between the two are shown in Table II. In distributed optimization, balanced and independent and identically distributed (IID) data are processed on fewer devices with unlimited communication, while in federated optimization, unbalanced and non-IID data are processed onmassive devices with limited communication, which is available to a central server.

      TABLE I SUMMARY OF ABBREVIATIONS IN THIS SURVEY

      A. Distributed Optimization

      In distributed optimization of multi-agent systems, agents cooperate to minimize the global cost function, which is a sum of local cost functions, based on its own information and neighboring information. Distributed optimization is widely studied in various fields, such as wireless sensor networks[34] and integrated energy networks [35]. Unlike the existing surveys, such as distributed optimization algorithms for undirected graphs [31] and distributed optimization for electric power systems [36], this survey mainly focuses on the latest research hotspots in recent years, i.e., distributed online optimization and its applications in privacy protection.

      Distributed online optimization:In traditional distributed optimization issues, it is usually assumed that every agent knows its local private cost function in advance [31].However, in certain practical scenarios, there is no prior knowledge and the cost function is time-varying because of the uncertain information. Therefore, the distributed online optimization (DOO) is proposed to minimize the accumulation of time-varying global cost functions [37]. In order to make the problem mathematically tractable, the existing works in [38], [39] are often settled in convex optimization.Distributed online convex optimization (DOCO) aims at minimizing the objective function and reducing the regret,thereby quantifying the performance of DOO algorithms [40].Regret can be divided into two types, static regret represents the comparison between the computed candidate optima and the best-fixed decision in hindsight, while dynamic regret considers the nature of the minimum value of the global cost function [41].

      Then, we focus on the solution to the DOCO problems. We first consider an ideal situation without constraints in DOCO,in that way, the online subgradient is an idea to solve the DOCO problems [42]. In recent years, with exact gradient information, gradient tracking based methods are suitable for handling time-varying cost functions. The dependency assumption is removed in [43], where the local objective functions are strongly convex in the gradient tracking algorithm. Then, when taking both exact gradients and stochastic/noisy gradients into consideration, distributed online gradient tracking algorithm is proposed in [39] to handle the DOCO problems with an aggregative variable. In addition, it is not necessary to assume any prior information on system evolution when considering sparse time-varying optimization, and the model proposed in [44] is suitable for adversarial frameworks as well. Considering the heterogeneity of network nodes, the distributed any-batch mirror descentalgorithm is proposed in [38], which is based on the distributed mirror descent with a fixed computing time every round, ensuring the speed of the overall convergence process of heterogeneous nodes. Node heterogeneity is also reflected in the objective function consists of two parts, a time-varying loss function and a regularization function. In this case, DOO algorithms based on approximate mirror descent are raised in[49].

      TABLE III SUMMARY OF DOO

      In addition, considering that constraints are common in actual distributed optimization problems [45], [48], auxiliary constraints are added to the distributed optimization algorithm to improve the practicality of the optimization results. For the case where there are constraints in DOCO problems, when DOCO problems are studied under the global/local set constraints, a dual sub-gradient averaging algorithm is proposed in [45] in the DOCO case. Besides, the celebrated mirror descent algorithm is improved in [37] and a regret bound is established to satisfy the constraints. It is worth noting that when the locally cost functions are considered strongly pseudoconvex, auxiliary optimization strategies are utilized to handle DOO in [47]. After that, they assume the objective function to be nonconvex, and propose a DOO algorithm based on the consensus and the mirror descent algorithm in [51]. When DOCO is constrained by coupled inequality constraints, the objective functions and constraint functions will be revealed over time. In [48], a distributed online primal-dual dynamic mirror descent algorithm is raised to investigate the time-varying coupled inequality constraints.After that, the primal-dual algorithm is modified in [39],promoting the result to unbalanced graphs without making any assumption about bounded parameters.

      Privacy protection:In some applications, local information is likely to be private and needs to be safely protected, such as smart grids, financial systems, medical treatments, etc.Although there exist research works on privacy protection in distributed optimization [35], most of them focus on static optimization. Note that when the agents share information, the privacy of the agents may be leaked at any time, so it is necessary to introduce DOO algorithms to solve privacy protection issues. Security is the top priority and is often treated as a constraint in DOCO issues [46]. Take smart grid as an example, contingencies, such as line overloads and voltage violations, should be immediately and automatically mitigated by the network itself. A distributed correction optimization approach is put forward in [52], in order to mitigate line overloads in an online closed-loop way. The system measurements and security constraints in the proposed DOCO algorithm ensure security.

      Under the premise of ensuring security, there exist two options for implementing privacy protection in distributed optimization, that is, differential privacy and information encryption [53]. From the aspect of privacy masking,differential privacy is one of the insightful privacy strategies.In the DOCO problem of [50], differential privacy is introduced, and a distributed stochastic subgradient-push algorithm considering differential privacy is also proposed.Although the local cost functions are masked, they can also achieve sublinear regrets for diverse cost functions. However,the increased noise that makes differential privacy more accessible will inevitably affect the data availability, leading to a trade-off between privacy level and calculation accuracy[54]. Therefore, information encryption is utilized to protect privacy indirectly, because encryption techniques cannot be directly applied to distributed optimization without third-party assistance. As seen from Table III, the privacy protection in DOO requires in-depth research, and there are several expansion directions, such as complex constraints and nonconvex. It is worth noting that there are already distributed offline optimization algorithms to solve constrained/unconstrained non-convex optimization problems over directed/undirected communication graphs [55]–[59]. Therefore, non-convex DOO will be a trend in future research to address physical applications where optimization problems are non-convex.

      Fig. 2. The federated learning framework.

      B. Federated Optimization

      In distributed optimization, the data centers require to process a large amount of data on fewer devices. In federated optimization, the data volume of a single device is small,while the number of devices is huge [60]. This data distribution is very common in modern mobile devices, but the individual data are basically privacy sensitive. Federated optimization is a structural concept first proposed by Kone?n?et al.[13], which conducts structural improvements to distributed optimization especially for privacy protection issues. It is worth noting that most machine learning algorithms require a large amount of data for training.

      How to integrate multi-party data for machine learning training under the premise of privacy is a worth considering problem. Without considering individual data privacy, data sharing is a simple way to realize the training of machine learning algorithms. As individuals pay more attention to individual privacy, general data sharing is not allowed.Federated learning is a solution for federated optimization,which focuses on adapting traditional machine learning to the realistic background of privacy protection and data island.According to different data distribution in practical problems,federated learning can be divided into three categories [61]:Horizontal federated learning, vertical federated learning and federated transfer learning.

      Fig. 2 briefly shows a very common structure of horizontal federated learning, specifically as follows:

      1) Participants download the latest global model from the server;

      2) Participants update local model parameters according to their existing data;

      3) Participants pass the trained local model back to the server in an encrypted form;

      4) The server consolidates the various models received and updates the global model parameters.

      Repeat these four steps until convergence. In this way,distributed training is realized without disclosing privacy.Unlike privacy protection in distributed optimization, thanks to the structure of federated optimization, model aggregation is a unique privacy protection mechanism for federated learning. Similarly to distributed optimization, federated learning also uses differential privacy and information encryption to achieve privacy protection. Therefore, we focus on three main categories of privacy mechanisms in federated learning: model aggregation [60], homomorphic encryption[62] and differential privacy [63].

      Model aggregation:Model aggregation is one of the most famous privacy mechanisms in federated learning, which trains a global model with distributed data, but cannot directly use data from each client. A natural way is to use the model parameters of the distributed clients to do an aggregation.Under the privacy mechanism of model aggregation, federated stochastic gradient descent (FEDSGD) and federated averaging (FEDAVG) are proposed in [60], which lay the foundation for model aggregation algorithms. FEDSGD directly migrates the traditional optimization algorithm –stochastic gradient descent (SGD) into the federated framework. This method has high computational efficiency,but it needs a large number of expensive communication rounds with the clients to update the global model’s parameters. Unlike FEDSGD, FEDAVG expects each client to conduct multiple rounds of training for local data, and then transmit the models’ parameters to the server, the server later conducts a weighted average of the parameters of local models and finally integrates them into the global model. By distributing the computation to each client, FEDAVG greatly reduces the communication cost of the training process.

      Most of the research results focus on developing algorithms to further reduce communication based on FEDAVG[64]–[68]. Considering the permutation invariance of neural networks, that is, changing the arrangement of parameters, the performance of neural networks remains unchanged. It is inefficient to directly average the network parameters of each client according to the coordinates. Federated neural matching proposes the idea of matching the client neural network before average [64]. Federated matched averaging further extends federated neural matching from fully connected network module to recurrent neural network (RNN) module and convolution module [65]. The convergence rate reduction caused by client-drift issue in FEDAVG is considered in [66]and the authors use control variable to correct the issue in client’s local updates. To achieve less updates in the global server network, the contribution of client capabilities is estimated at [68]. In addition, a attention mechanism is introduced in [67] to improve the model aggregation capability.

      The model aggregation method uses plaintext parameters instead of the raw data to achieve privacy protection and proves that FEDAVG converges on some data problems that are non-independent and identically distributed. However, the plaintext parameters still has the risk of privacy disclosure[69].

      Homomorphic encryption:Data protection has always been a problem considered by traditional cryptography. Traditional cryptography realizes data protection by encrypting data [70].Only by using the correct key, the raw data can be extracted from ciphertext. In this case, data can be stored anywhere, and privacy disclosure will not occur. In cryptography, if the cost of cracking the key exceeds the gain, we believe it is computationally, or conditionally secure. Homomorphic encryption is a form of encryption which allows users to perform the calculation of encrypted data without decryption. These calculation results are retained in the form of encryption, and the decryption of the calculation result will produce the same output as the operation on unencrypted data [71]. Assume thatEis an encryption algorithm,Mrepresents the information that needs to be encrypted, ⊙ represents one of the calculation operations. If the following equation is satisfied:

      the encryption algorithmEhas a homomorphic property for the operation ⊙. The current mainstream homomorphic Encryption algorithms include partially homomorphic encryption and fully homomorphic encryption [72]. Fully homomorphic encryption is powerful, but partially homomorphic encryption requires less performance overhead. Most of homomorphic encryption works focus on how to combine with federated learning in different scenarios [62], [73], [74].

      Under vertical data distribution, combining additively homomorphic encryption with federated logistic regression to achieve encrypted training is proposed in [62]. Moreover, on the cross-silo federated learning, the minimum homomorphic encryption requirement is discussed in [73]. In a more practical application background such as different industrial devices, traditional kmeans and AdaBoost combined with homomorphic encryption are discussed in [74].

      Differential privacy:Recall that model aggregation has the risk of privacy disclosure, and the use of homomorphic encryption technology can ensure data security, but even partially homomorphic encryption algorithms have the problem of large computational costs. How to make a compromise between computational cost and privacy protection, some researchers turn their perspective to differential privacy. Differential privacy is mainly used to prevent privacy disclosure caused by statistical information.Adding perturbation is the most commonly used method in differential privacy [75]. Each participant disrupts his/her own data, and then sends the disturbed data to the server. In this case, no attackers can get individual privacy data by statistical calculation of the information on the communication channels.The primary perturbation mechanism can be roughly divided into three categories: Laplace mechanism and exponential mechanism and randomized response mechanism. Similarly to homomorphic encryption, federated learning combined with differential privacy also mainly considers application in various scenarios: Aircomp federated learning [76],blockchain-enabled federated learning [74] and mobile edge computing [63].

      Among the three privacy mechanisms mentioned above, the model aggregation algorithm is the simplest one to implement but has the high risk of privacy disclosure. Homomorphic encryption has the best confidentiality but is computationally complex. Differential privacy ranks the second among the three methods in terms of complexity and confidentiality, but the accuracy may decline due to perturbation. It is necessary to select appropriate privacy mechanisms according to specific application scenarios.

      III. GAME THEORY

      In game theory, a game concerns the optimal strategic interactions between agents. Games provide complex systems science with a systematic approach for deciding a series of Pareto-efficient strategies in cooperative situations or the best strategy in non-cooperative situations, which have attracted a great deal of attention in recent years. From the perspective of the timing of behavior, games can be divided into static games and dynamic games [14]. Static games mean that the participants choose at the same time or although not at the same time, but the later participant does not know what specific actions the former participant took; on the other hand,dynamic games mean that the participants’ actions have a sequential order, and the later participant can observe the actions chosen by the former participant.

      A. Cooperative Games

      Cooperative games are also called positive sum games and concerned with the participants forming alliances and working together to seek to achieve their common goals [77]. Similarly to cooperative optimization, cooperative games also consider coordinating multiple individuals to achieve unified goals.The difference is that cooperative optimization focuses on the analysis of optimization goals, while cooperative games focus on the emergence of cooperative behaviors.

      Static cooperative games:If two or more players agree to cooperate while playing games, they will help each other minimize their costs, as long as they do not decrease their payoffs. This leads to the concept of absolutely cooperative solutions or Pareto optimal solutions in games. Such a solution ensures that each player cannot change their strategy for a better solution. From this perspective, we explore the Pareto optimal solutions under the background of static,continuous cooperative games. The Pareto optimal solutions can be described as follows:

      To solve Pareto optimality, a classical method is to convert it into a single-objective optimization problem, that is, to obtain a point on the Pareto front in each simulation. There are some representative methods at present, such as Tchybeshev,weighted sum, perturbation, geometric mean, min–max, goal programming, and physical programming [79]. In contrast,evolutionary computation, which can obtain a large number of Pareto solutions after one simulation, is the current mainstream method [80].

      Dynamic cooperative games:The dynamic cooperative games refer to the games that the participants take alternate actions to achieve a common goal. In general, we consider dynamic cooperative games from the perspective of Markov/stochastic games [17], which generalize Markov decision processes to multiple interacting decision-makers.

      Multi-agent reinforcement learning (MARL) has made a great success in cooperative Markov games, especially when modeling and analysis for games are difficult. A lot of works exist in the fields of smart grid, traffic light control, home energy management and smart factories [81]–[83]. In cooperative MARL, each agent gets a shared reward, the cooperative Markov games can be modeled as maximizing the accumulation of rewards, and the Pareto optimal solutions are reduced to a unique optimal solution. The generation of the optimal strategy mainly relies on the trial and error in MARL progress. This section provides a brief survey on deep cooperative MARL, where the function relations are obtained by training neural networks.

      One naive approach to solving an MARL problem is to train all agents using a central controller. In this structure, each agent needs a real-time data transmission with the central controller, and each action of the agents is strictly implemented under the requirements of the central controller.However, in most cases, communication is expensive, and it is difficult to achieve a real-time communication. In addition, as the number of agents increases, the neural network used by deep RL will also increase greatly, which has high requirements for the capacity of the central controller.Another naive approach to solving an MARL problem is to train each agent with individual RL [84]. This method has the following good characteristics compared with the centralized method:

      1) Do not need a real-time data transmission and the decisions are generated by each agent itself;

      2) No large-scale problems and neural networks are scattered across agents.

      The strategy learning of each agent affects the strategy learning of other agents. For a single agent, if other agents are regarded as part of the environment, this environment is equivalent to a dynamic environment. The environment will change due to the actions of other agents, not just over time.This decentralized decision making can be modeled as decentralized partially observable Markov decision processes(Dec-POMDPs) [85], where action selection is under uncertainty and incomplete knowledge. In Dec-POMDPs, the training stability of single agent RL is poor, which directly hinders the extension to MARL. Fig. 3 shows the structure transformation of three kinds of multi-agent interactions in centralized training and decentralized execution (CTDE)framework, where Fig. 3(a) represents the use of a centralized controller to control all agents; Fig. 3(b) represents the realization of multi-agent control with limited communication;Fig. 3(c) represents controlling each agent through individual controller [81].

      Fig. 3. Transformation of three representative information structures in centralized training and decentralized execution framework. (a) Training process: Centralized setting with significant communication costs; (b)Execution process: Decentralized setting with networked agents based on limited communication bandwidth; (c) Execution process: Fully decentralized setting without communication.

      We will discuss cooperative MARL briefly through three main research areas:

      1) Learning a communication protocol [86];

      2) Learning a reward decomposition protocol [87];

      3) Learning a critic network guaranteed stability training with global observation [88].

      They all provide a reasonable breakthrough point for solving Dec-POMDPs problems.

      Communication protocol learning:The cooperation in limited communication bandwidth depends on a communication protocol to coordinate each agent. One appropriate method to solve this question is to regard the communication as an action of each agent, that is, communication-action. In this case, each agent needs to decide what message is transmitted, which agent to transmit, and what the message received means. As shown in Fig. 3(b), each robot has the ability to communicate with other robots. Using deep RL to realize the learning of communication protocols is first considered in [86]. At each time instant, agents need to simultaneously determine an environmental-action and a communication-action. The environmental-action will affect environmental transfer and the communicate-action will affect the environmental-action selection in other agents. As discussed before, Dec-POMDPs significantly affects the training stability of MARL algorithm. Agents can use the information of other agents to stabilize individual RL training.Moreover, they focus on the setting with CTDE framework.There is no limit to communication during training, but communication is only via the limited-bandwidth channel during execution (testing).

      After that, more efficient communication strategies have become an important part of communication protocol learning. It is found in [89] that agents communicating by continues symbols has outperformed communicating with discrete symbols, because the information corresponding to continuous communication is differentiable and it can be trained efficiently with back-propagation. The lousy interaction effect between agents with continuous communication is considered in [9], where the received message may be harmful to training efficiency in some cases, and a gating mechanism allowing agents to block their communication is proposed. Besides, they think that it is difficult for each agent to know its individual contribution with a global reward,especially on large-scale tasks and it is proposed to train with its individualized reward. Similarly to [9], the lousy interaction effect question is also considered in [90] and the authors propose a targeted communication architecture for MARL. This targeted communication architecture is realized through an attention mechanism. The sender broadcasts a key that encodes the information to be transmitted, and then the receiver measures the correlation of information through this key, so as to determine whether a communication connection is established. They also suggest a multi-round communication structure, in which agents perform multi-round communication rather than just one round before taking action. Relatively special, it is believed that learning communication-actions will increase exploration space in[91], thereby reducing learning efficiency. In this way, a centralized RL neural network without communication is trained, then the decentralized executable neural network with communication is derived from the trained centralized RL neural network via policy distillation.

      In cooperative MARL, communication can make up for the lack of individual information caused by limited observation,and stabilize the training. Unlike using a centralized controller, communication protocol learning is mainly considered in a limited communication capacity scenario.Agents learn communication protocols and cooperation strategies at the same time. However, in some scenarios,agents have no communication ability, the trained model obtained by the following two methods can be used in noncommunication scenarios.

      However, the monotonic constraint is also too strict (the monotonic constraint is a sufficient condition for the IGM condition). Much work has since focused on how to relax constraints betweenQtotandQi(make the constraints betweenQtotandQibe a necessary and sufficient condition for the IGM condition). QTRAN structure [93] guarantees a more general decomposition than QMIX and VDN with a complex loss design. Weighted-QMIX (WQMIX) structure [94]provides greater learning rate for better action combinations to achieve preference for optimal action combinations. Unlike QTRAN and WQMIX realizing IGM conditions through approximation, QPLEX structure [95] realizes the strict construction of IGM conditions through a duplex dueling architecture. Qatten structure [96] introduces a multi-head attention mechanism based on QMIX to improve the approximation ability ofQtot. The previous work of value decomposition is mainly based on the deep Q-learning (QL)structure, in [97], the value-decomposition methods are extended to the actor-critic framework.

      How to decompose the reward generated by the grand alliance to each participant has always been a hot topic in the cooperative games. Traditional methods are inspired by marginal contribution like the Shpaley value [98]. Here, the reward decomposition protocol is obtained by neural network based on IGM conditions. However, most of the works are constructed on Q-learning (QL), and can not solve the problems of cooperation with continuous action space.

      Global observation critic network learning:Unlike the reward decomposition protocol learning method, the global observation critic network learning method based on actorcritic algorithms (AC) is suitable for solving the cooperation problem of continuous action space. The MARL instability problem is formulated as a probabilistic form

      Specifically, as long as we know the action of each agent,the state transition probability is determined. Similarly to most MARL methods, they take advantage of the CTDE framework that allows additional information to stabilize training as long as it is not used during the test. Considering that traditional Qlearning can not use different information in testing and training, they propose multi-agent deep deterministic policy gradient (MADDPG) algorithm based on actor-critic policy gradient methods instead. MADDPG simply achieves training stability by adding additional information to the critic network. Most of the research in global observation critic network learning is focused on MADDPG and its extensions.In [99], it is shown that MADDPG fails in some simple observable cooperative tasks. A recurrent multi-agent actorcritic architectures for message passing and movement then proposed to integrate RNN into MADDPG. Moreover,considering the similarity between MARL with natural language processing problem, the meaning of each agent’s action (a word in a paragraph) depends on the entire system(the paragraph context). Then, the self-attention mechanism is integrated into MADDPG in [100]. Other researches consider improving algorithm efficiency from the training process,such as: modeling multi-agent policy transfer learning with option learning [101] and using a distributed prioritized experience replay [102].

      The methods above extend the single agent RL algorithms to multi-agent RL in order to solve the dynamic cooperative games. As seen from Table IV, the current work of cooperative MARL tends to use a CTDE structure and RNN module to stabilize the learning process. The information structure transformations of the three methods are shown in Fig. 3. Most of the recent works on cooperative MARL use additional global information to stabilize the training process,which is based on the structure shown in Fig. 3(a), but the trained models have different landing forms in execution process. The first research area achieves cooperation by learning communication protocols, so the trained models need a limited communication bandwidth, which is based on the structure shown in Fig. 3(b). The other two research areas are not built in the context of communication, so the trained model can be applied in non-communication scenarios, which is based on the structure shown in Fig. 3(c).

      B. Non-Cooperative Games

      Non-cooperative games are suitable for analyzing the behaviors of players who believe that their own payoffs conflict with the payoffs of others in games. In noncooperative games, individual payoffs are maximized and individual costs are minimized to achieve the Nash equilibrium (NE) [14]. For example, the optimal strategies of capture-the-flag differential games are obtained by geometric method in [19]. Unlike cooperative games, competitive behaviors between agents exist in non-cooperative games and agents cannot reach a binding agreement.

      In multi-agent systems, we first define the NE, that is

      Static non-cooperative games:In static games, players choose their strategies at the same time [103]. In this situation,each player does not know the choices of the others and he (or she) needs information to make decisions. The static non-cooperative games can be divided into complete information games and incomplete information games according to the players’ understanding of other players. In complete information games, each player has accurate information about other players’ characteristics, strategy spaces and cost functions. Otherwise, it refers to incomplete information games. In addition, static non-cooperative games can be divided into complete information static games and incomplete information static games. The equilibrium concepts corresponding to the two kinds of games are NE and Bayesian Nash equilibrium (BNE).

      TABLE IV SUMMARY OF COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING METHOD

      Complete information static games are the most basic type of non-cooperative games, such as classic Prisoner’s dilemma,Cournot games, Chicken games and so on. Static games with complete information can be seen in many different situations,such as auction bids, network security, transport efficiency and so on. With the rapid development of the network,network security [105] has attracted more and more attention.It is different from the general research on information security based on the relationship between attackers and defenders. Aiming at the strategies selection of open testing service for medium-sized software vendors, the authors propose a research method based on the static games model of software vendor, white and black with complete information.In addition, static games with complete information also play an important role in transportation. For example, collaborative control of connected autonomous vehicles has the potential to significantly reduce negative impacts on the environment while improving driving safety and traffic efficiency [106],[107].

      When a player’s strategic combination in a static Bayesian games is a BNE, no player is willing to change his strategies,even if the change involves only one action of one type.

      In recent years, researchers have carried out many studies on static games with incomplete information. With the increasing popularity of electric vehicles and the lack of poor battery life, an electricity transaction scheme based on Bayesian games pricing in the blockchain-supported internet of vehicles is proposed, and users’ approximate satisfactions are verified in [109]. When the Markov games with incomplete information generated by the defense strategies of moving target are in a certain network state, they can be regarded as static games with incomplete information. In view of the problem that network defense based on incomplete information games ignores the types of defenders, which leads to an improper selection of defense strategies, an active defense strategy selection based on static Bayesian game theory is proposed in [110] and the authors construct a static Bayesian game model. 5G networks communication with incomplete information games mechanism based on static games repeated model is studied in [111], and the proposed mechanism provided a more realistic and universal model for resource allocation of communication between multi-unit devices. A memory-based game theory defense method under incomplete information is mentioned in [112].

      Dynamic non-cooperative games:From the perspective of cooperative games, the MARL methods mentioned above are also applicable to dynamic non-cooperative games [9], [88].Specifically, the cooperative agents in the non-cooperative game form alliances. The agents within each alliance cooperate in the game, and use the same cost function and MARL method, while the agents in different alliances perform non-cooperative behaviors, and use different cost functions or even different MARL methods. Finally, the best strategy for non-cooperative games can be obtained.

      From the perspective of non-cooperative games, it is appropriate for analyzing the behaviors of agents whose payoffs completely conflict with the payoffs of others in the system. Therefore, the dynamics of multi-agent systems are modeled and the cost functions are characterized, and then the optimal strategy is obtained by minimizing the cost function and maximizing the reward function. Similarly to static noncooperative games, this survey also introduces two aspects of dynamic non-cooperative games: Complete information and incomplete information. The equilibrium concepts corresponding to these two refer to subgame perfect Nash equilibrium and perfect Bayesian Nash equilibrium (PBNE),respectively.

      In complete dynamic non-cooperative games, take pursuitevasion games [18] and reach-avoid games [113] as examples.In the reach-avoid games, the attackers need to reach the target without being intercepted by the defenders, whereas the defenders need to capture the attackers before they reach the target. Moreover, capture-the-flag games are more complex than reach-avoid games in game settings. In capture-the-flag games [19], [20], two opposing teams control their respective areas and protect the flags in that area. Capture-the-flag games consist of two phases, pre-capture and post-capture, and each phase contains a number of potential competitive objectives.The victory condition is that agents enter the opponent’s area to capture the flag, and then return to their own area without being captured by the opponent.

      Fig. 4. The basic setting of capture-the-flag games [114].

      In capture-the-flag games, players must consider not only the capture region but also the return region when choosing their actions. The basic settings of capture-the-flag games can be seen in Fig. 4. Numerical solutions to the Hamilton-Jacobi-Isaacs (HJI) equations are used to describe the initial conditions and strategies for each agent to win [114]. More generally, the winning strategies of the opponents in capturethe-flag games can be viewed as a solution to a zero-sum differential game. The goal of each team can be expressed as a value function that can be maximized or minimized by changing their strategies. With some processing, the value function can be calculated from the solution of the HJI equation. The payoff function of each team can be expressed as

      whereλis the costate vector anduis the optimal control.

      Moreover, there have been a considerable amount of studies on capture-the-flag games up to now. In addition to solving the optimal strategies by the HJI equation, there are many researchers using geometric methods to consider the optimal problems of capture-the-flag games. For example, the twostage synergistic optimal strategy and winning region for capture-the-flag games are proposed in [19]. The optimal capture point can be obtained by Apollonius circle. In [116],for the two-speed ratios, an attack zone method based on the Voronoi diagram and Apollonius circle is proposed to construct barriers analytically.

      In addition, dynamic games are often formulated in an extensive form [14], described as game tree, which defines the order, information, alternatives and outcomes of dynamic games. In other words, extensive-form games generalize normal-form games by modeling sequential and simultaneous moves, as well as private information. The stable state of dynamic non-cooperative games can be solved by the NE and backward induction, named subgame perfect NE [117]. At each stage of the game, backward induction determines the player’s optimal strategy for the last step, i.e., NE. Then,taking the last player’s move as a given, the best move for the penultimate player is determined. This process continues backwarding until the best action at each point of the games is determined. In other words, Nash backward induction [22] is used to determine the NE of each subgame of the original games. However, in practical games, such as Texas Hold’em and StarCraft II, the information of the games is often imperfect/incomplete. Therefore, the results of Nash backward induction often fail to conduct the NE [118].

      In incomplete dynamic non-cooperative games, as mentioned in previous content, with the help of the Harsanyi transformation, the incomplete information games are able to be converted to an equivalent complete but imperfect information games. In that way, we only need to consider the case of imperfect dynamic non-cooperative games.

      In a word, regret matching mainly focuses on every information set in a game tree, modeling each relevant player,and iteratively solving the NE. That is to say, each player can only observe the received payoffs and choose a better reply with respect to the average realized payoff. There is no need to know the number of players and the cost functions in the games. On the contrary, with the perspective of modeling the games from a global, fictitious play (FP) [127] is a good choice. Similarly to regret matching, FP also finds or approaches the NE in an iterative manner. Specifically, in FP,each player can observe other players’ actions and choose the best response to his belief. It is necessary for players to acquire the knowledge of cost functions.

      FP is a method of learning the NE in normal-form games by playing the game repeatedly and choosing the best response according to each other’s average behavior. Fictitious selfplay (FSP) [24] extends FP to extensive-form games with sampling-based machine learning approach. Therefore, the curse of dimensionality is broken. Combining neural network function approximation with FSP, neural fictitious self-play(NFSP) [128] is proposed. NFSP consists of two neural networks, one is the best response policy network, learning an approximate best response to the history behaviors of other players by RL; the other is the average policy network, trained from memorized behaviors of the best response policy by supervised learning. It is worth noting that NFSP is the first end-to-end RL method to approach to the NE in imperfect information games without any prior knowledge. The strategies are iteratively updated to approximate the PBNE

      In order to improve the stability and speed of RL training,asynchronous NFSP [129] is proposed, which shares deep Q network (DQN) and supervises learning network between players in parallel, so as to accumulate and calculate gradients.In that way, the data storage memory is smaller than NFSP.With the help of policy gradient method, self-play pretrains the model to improve the scale of NFSP. For the improvement of RL progress, the authors propose a situation evaluation reward in each round and a future reward. In order to improve efficiency in large-scale extensive-form games, network security games are combined with NFSP in [130], enabling NFSP with high-level actions. The best response policy network is based on deep recurrent neural network, designed for natural languages domain, to define the noticeable legal actions of network security games. For the average policy network, the proposed metric-based few-shot learning [131]utilizes information contained in graphs to transfer actions and states between metrics.

      In MFGs, an important concept is the Nash certainty equivalence (NCE) principle introduced in [132], which provides a way to obtain a specific estimate of the mean-field term, thus providing an approximated PBNE.

      Since MFG theory provides a powerful methodology for obtaining decentralized control strategies for large-scale dynamic non-cooperative multi-agent systems, it has received considerable attention since the pioneering work in [132],[133]. By introducing a major agent, MFGs for major-minor agent systems are studied in [134]. Risk-sensitive and robust performance indexes are considered in [135]. There are also a few works focusing on discrete-time settings, such as [136].

      Applications of dynamic non-cooperative games:The difficulty level of dynamic non-cooperative games is ranked from easy to difficult as follows:

      I) Games with complete information, such as AlphaGo and AlphaZero;

      II) Games with incomplete information and a single goal,such as Texas Hold’em;

      III) Games with incomplete information and complex goals,such as StarCraft II.

      The examples of these three categories lie in Fig. 5.

      In Category I, we take Go as an example. Due to the challenges of huge search space and sparse reward, AlphaGo[137] utilizes supervised learning from human experts and the self-play RL framework to train the deep neural networks, and it is the first time that AI defeats a professional in Go games.The self-play RL framework is designed to generate new datasets and directly approximate the optimal value function.Then, AlphaGo Zero [138] improves the Go games by using self-play RL from scratch, without human data and any domain knowledge other than the basic rules. After that,AlphaZero [28] is put forward, which has succeeded in generalizing the structure from Go games to other zero-sum games of complete and perfect information, such as chess and shogi by means of a general self-play RL algorithm. MuZero[139] extends AlphaZero to a wider situation, including a single agent and non-zero rewards for intermediate time steps.

      In Category II, we take Texas Hold’em as an example.Libratus, put forward by Brown and Sandhoim in [140], focuses on three application-independent algorithms and defeats four top human specialists in no-limit two-player Texas Hold’em.The first algorithm proposes the blueprint strategy, with the help of Monte Carlo CFR [123], that simplifies the action space by skipping suboptimal actions for repeated iterations in the game. The second algorithm performs the real-time calculation of the sub-game strategies based on the current game state without conflicting with the blueprint strategy. The third algorithm aims at improving the blueprint strategy by absorbing the actual actions of the opponents. Then, Pluribus[141] has conducted a landmark extension from the scale of two-player Texas Hold’em to six-player mode. In order to improve the efficiency of Monte Carlo CFR in Pluribus,during the training stage of [29], linear weighted discounts are introduced in the early stages, and negative regret behaviors are strategically pruned in the later stages.

      In Category III, we take StarCraft II as an example.Alphastar [30] is designed for StarCraft II, which is composed of four leagues, named main agents, main exploiters, league exploiters and past players, respectively. Main agents are trained to defeat all players, main explorers are designed to find the weakness of main agents on the basis of human information, and league exploiters aim at finding the weakness of the system, training with the historical data stored by past players. Prioritized fictitious self-play (PFSP) is proposed to calculate the best response to the mixed previous strategies, so that main agents will give priority to players with low winning rates. The performance of Alphastar can reach the Grandmaster level in all three StarCraft II races.After that, StarCraft Commander [142] optimizes the Alphastar in computational resources and robustness, which is able to acquire most of the Alphastar performance with fewer replays and datasets. In addition, during the RL training stage,branching agents are proposed to improve training efficiency and balance the strength and diversity of strategies.

      As seen from Table V, the current work of dynamic noncooperative games mainly focuses on discrete action space with complete information in the situation of perfect competition, more difficult game types and complex situations r emain to be considered.

      Fig. 5. Difficulty category examples of non-cooperative games. (a) AlphaZero belongs to Category I [28]; (b) Pluribus belongs to Category II [29];(c) AlphaStar belongs to Category III [30].

      TABLE V SUMMARY OF DYNAMIC NON-COOPERATIVE GAMES

      IV. FUTURE WORK

      A. Cooperative Optimization

      Non-convex:Distributed optimization for solving nonconvex optimization problems has attracted considerable attention in many fields. In the current research, distributed algorithms have been developed to solve unconstrained/constrained non-convex DOO issues with undirected/directed graphs [51], [144]. It is worth applying the existing distributed optimization algorithms to non-convex optimization problems with theoretical guarantees, switching the goal from finding a globally optimal solution to finding a stationary point or a local extremum. In that way, the polynomial dependence of the complexity of the first-order methods on the dimension of the problem and the desired accuracy can be obtained by polynomial approximation, thereby achieving convergence to the global optimal from the algorithm level.

      Heterogeneity:Heterogeneity greatly hinders the integration in federated optimization, which contains only one central server integrating data from different clients. However,inspired by the existing research of data heterogeneity [145],multi-center is expected to be a solution to client heterogeneity. Specifically, different local central servers are used to integrate different types of clients. Then, according to the actual application scenarios, it is determined whether multiple local central servers need to be integrated into a global central server.

      Security:From the perspective of privacy protection, the security of the cooperative optimization system is improved by differential privacy, information encryption and model aggregation. With the development of quantum computers,traditional encryption technologies will no longer be secure.Therefore, quantum encryption is expected to become the mainstream encryption technology for privacy protection in the process of cooperative optimization. Recently, there has been some applied research on quantum homomorphic encryption [146]. In the future, extending the quantum homomorphic encryption to cooperative optimization is a promising direction.

      Lightweight network:A lightweight network scale is one way to ensure security and robustness. In addition to end-toend training using learning methods, such as neural architecture search and network pruning, combining learning methods with analytical methods have been recently considered to reduce network complexity while maintaining model robustness and security. For example, the physicsinformed neural network (PINN) [147] is able to approximate and solve complex partial differential equations using deep neural networks, RL, etc., while adding interpretability to the learning methods. In the future, it is an interesting idea to utilize PINN for optimization problems that are easy to model but difficult to solve.

      B. Cooperative/Non-Cooperative Games

      Since non-cooperative games may include some cooperative behaviors, cooperative and non-cooperative games have overlaps in challenges. Therefore, for future work, the ideas below are not limited to cooperative games or non-cooperative games alone.

      Heterogeneity:For many-to-many heterogeneous cooperative games, such as matching games [148], the current literature [149] mainly focuses on multi-connectivity, multichannel and multi-radio conditions. An interesting future direction is to increase the consideration of heterogeneous information from system dynamics, such as acceleration and jerk constraints, in order to achieve more adequate allocation of limited resources among heterogeneous agents.

      Interpretability:For ante-hoc interpretability, it makes the model itself interpretable. In Bayesian games, due to the assumption of conditional independence, existing work [21]transforms the decision-making process of the game models into probability calculations. In addition, the attention mechanism makes the model well interpretable by focusing on the objects of interest in games. In future, introducing transformers to allocate limited resources and assist the decision-making process of the games is a promising direction. For post-hoc interpretability, it uses interpretability techniques to interpret the model. In cooperative games, few researchers have explored why the CTDE structure can achieve improved results in various domains. A meaningful task is to prove that the iterative process of CTDE is a contraction mapping, and then prove that the CTDE algorithm converges to the optimal strategy through Banach’s fixed point theorem. In addition, PINN endows deep neural networks with interpretability that is not available on the basis of mathematical knowledge and physical laws. Encouragingly,games can provide interpretable support for research in other fields. Inspired by Eigengame [150], it is an attractive idea to abstract the goals and construct game models for non-game problems and choose the corresponding optimal strategy for some of the agents involved. Therefore, the analysis method is changed and the interpretability is improved.

      Large scale:Considering the scale of agents in multi-agent cooperative/competitive systems, the action combination space increases exponentially. It increases the difficulty of finding the NE or globally optimal solutions. By means of hierarchical networks, such as hierarchical RL [151] and hierarchical graph attention network [152], the action combination space is dispersed to each layer. Then, the cooperative/competitive relationship between agents can be described. Therefore, the multi-dimensional game equilibrium can be accurately obtained through end-to-end training. In addition, since each agent needs to know global information in the training process of CTDE algorithms, with the increase of the number of agents, the communication cost will become a challenge. Inspired by federated learning, which is widely used to optimize of a large number of client models, MARL combined with federated learning is expected to reduce the communication costs in the future. Solutions to noncooperative games will be upgraded to MFGs as agents scale from massive to infinite. The existing literature focuses on, for example, mean-field MARL [153], major-minor agent systems [134], and risk-sensitive and robust performance[135]. A feasible direction for MFG in the future is to explore MFG with a finite number of agents.

      Few-shot:Games in multiple fields, such as economy,public opinion and energy, lack consideration of information factors in an open environment. Due to the low-texture and high-dynamic characteristics of the open environment, the environmental data samples are few and difficult to collect.Therefore, there are often incomplete/imperfect information in games, such as deceptive information, uncertain interaction information, asymmetry network, etc. Incomplete/imperfect information makes the agents unable to identify cooperative/competitive intentions, and it is difficult to predict the actions of the agent, thereby affecting the formulation of game strategies. The existing literature [154] utilizes few-shot learning to improve the accuracy of perception and decisionmaking of game model in real scenes. On this basis, inverse RL based MFGs [155] can be constructed for handling intent identification and the NE solution, and evolutionary game theory [156] is used to predict the changes in agent actions in real time. In the future, it will be a popular idea to solve the few-shot problems in games from the perspective of transferability, i.e., game setting transfer, framework transfer and strategy transfer. From an algorithmic perspective, meta learning and its variations are good choices for transferability,based on the latest meta-learning ideas, i.e., graph neural network meta learning [157] and domain-augmented meta learning [158], covering migration from communication topology transfer to game environment transfer. From an environmental perspective, universal decision intelligence platform, such as OpenDILab [159], can integrate existing decision-making AI algorithms to achieve transferability and scalability of various computing scales.

      C. Applications

      In the field of integrated energy networks, consider using distributed RL to design multi-dimensional cost functions for different types of energy sources and energy-consuming devices with different characteristics. In addition, for lack of prior knowledge and multiple uncertain optimization problems, federated optimization is a feasible way to achieve real-time scheduling and enhance the privacy protection of information from various data sources. Furthermore,redesigning the control inputs is an idea to ensure linear convergence rates when system dynamics are incorporated into the optimization objective.

      In the field of information security, from the perspective of data integrity and availability, the problems of network attack in non-cooperative games are considered. It is suggested to break through the limitations of incomplete/imperfect information conditions by means of intention recognition and aggressive behavior prediction, and utilize the evolutionary game theory to solve non-cooperative games with cyberattacks; From the perspective of data security, the privacy protection problems are solved by constructing a federated optimization framework that does not directly use client data,but trains machine learning algorithms.

      In the field of biomedicine, in terms of molecular synthesis,considering the optimization of the predictive molecular model, under the conditions of multi-objective and multiconstraint, a cooperative game model is constructed to obtain the target macromolecular structure with maximum utility. In order to solve the bottleneck of large-scale data transmission in CTDE, it is an feasible idea to consider federated MARL,which can ensure the security and transmission efficiency of the macromolecule database.

      V. CONCLUSIONS

      In this survey, we analyze the cooperative and competitive multi-agent systems, focusing on the progress of optimization and decision-making. Starting with multi-agent cooperative optimization, we survey the distributed optimization and federated optimization focusing on online optimization and privacy protection. Then we consider the possible noncooperative behaviors between agents during the optimization process. Therefore, the focuses of our survey range from cooperative optimization to cooperative/non-cooperative games. We survey the relevant articles on multi-agent cooperative/non-cooperative games from static and dynamic perspectives. It is worth noting that for static/dynamic noncooperative games, we also summarize separately from two perspectives of complete/incomplete information. At last, we put forward future work on cooperative optimization,cooperative/non-cooperative games, and their applications.

      柯坪县| 铁岭市| 新晃| 古交市| 龙川县| 吉木萨尔县| 葫芦岛市| 嘉黎县| 玛曲县| 崇仁县| 清涧县| 九龙坡区| 甘洛县| 酒泉市| 灵武市| 富蕴县| 葫芦岛市| 赤峰市| 阳谷县| 余姚市| 昌都县| 称多县| 明光市| 扎囊县| 辉南县| 九龙坡区| 石嘴山市| 邹城市| 阿图什市| 吉木乃县| 南木林县| 商南县| 西青区| 赤城县| 永吉县| 桦南县| 馆陶县| 安平县| 石首市| 广汉市| 龙口市|