• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    多智能體博弈學(xué)習(xí)研究進(jìn)展

    2024-11-22 00:00:00羅俊仁張萬(wàn)鵬蘇炯銘袁唯淋陳璟
    關(guān)鍵詞:建模智能策略

    摘 要:

    隨著深度學(xué)習(xí)和強(qiáng)化學(xué)習(xí)而來(lái)的人工智能新浪潮,為智能體從感知輸入到行動(dòng)決策輸出提供了“端到端”解決方案。多智能體學(xué)習(xí)是研究智能博弈對(duì)抗的前沿課題,面臨著對(duì)抗性環(huán)境、非平穩(wěn)對(duì)手、不完全信息和不確定行動(dòng)等諸多難題與挑戰(zhàn)。本文從博弈論視角入手,首先給出了多智能體學(xué)習(xí)系統(tǒng)組成,進(jìn)行了多智能體學(xué)習(xí)概述,簡(jiǎn)要介紹了各類多智能體學(xué)習(xí)研究方法。其次,圍繞多智能體博弈學(xué)習(xí)框架,介紹了多智能體博弈基礎(chǔ)模型及元博弈模型,均衡解概念和博弈動(dòng)力學(xué),學(xué)習(xí)目標(biāo)多樣、環(huán)境(對(duì)手)非平穩(wěn)、均衡難解且易變等挑戰(zhàn)。再次,全面梳理了多智能體博弈策略學(xué)習(xí)方法,離線博弈策略學(xué)習(xí)方法,在線博弈策略學(xué)習(xí)方法。最后,從智能體認(rèn)知行為建模與協(xié)同、通用博弈策略學(xué)習(xí)方法和分布式博弈策略學(xué)習(xí)框架共3個(gè)方面探討了多智能體學(xué)習(xí)的前沿研究方向。

    關(guān)鍵詞:

    博弈學(xué)習(xí); 多智能體學(xué)習(xí); 元博弈; 在線無(wú)悔學(xué)習(xí)

    中圖分類號(hào):

    TP 391

    文獻(xiàn)標(biāo)志碼: A""" DOI:10.12305/j.issn.1001-506X.2024.05.17

    Research progress of multi-agent learning in games

    LUO Junren, ZHANG Wanpeng, SU Jiongming, YUAN Weilin, CHEN Jing*

    (College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China)

    Abstract:

    The new wave of artificial intelligence brought about by deep learning and reinforcement learning provides an “end-to-end” solution for agents from perception input to action decision-making output. Multi-agent learning is a frontier subject in the field of intelligent game confrontation, and it faces many problems and challenges such as adversarial environments, non-stationary opponents, incomplete information and uncertain actions. This paper starts from the perspective of game theory, firstly gives the organization of multi-agent learning system, gives an overview of multi-agent learning, and briefly introduces the classification of various multi-agent learning research methods. Secondly, based on the multi-agent learning framework in games, it introduces the basic multi-agent game and meta-game models, game solution concepts and game dynamics, as well as challenges such as diverse learning objectives, non-stationary environment (opponent), and equilibrium hard to compute and easy to transfer. Then comprehensively sort out the multi-agent game strategy learning methods, offline game strategy learning methods and online game strategy learning methods. Finally, some frontiers of multi-agent learning are discussed from three aspects of agent cognitive behavior modelling and collaboration, general game strategy learning methods, and distributed game strategy learning framework.

    Keywords:

    learning in games; multi-agent learning; meta-game; online no regret learning

    0 引 言

    人類社會(huì)生活中存在著各種不同形式的對(duì)抗、競(jìng)爭(zhēng)和合作,其中對(duì)抗一直是人類文明發(fā)展史發(fā)展的最強(qiáng)勁推動(dòng)力。正是由于個(gè)體與個(gè)體、個(gè)體與群體、群體與群體之間復(fù)雜的動(dòng)態(tài)博弈對(duì)抗演化,才不斷促進(jìn)人類智能升級(jí)換代[1]。人工智能技術(shù)的發(fā)展呈現(xiàn)出計(jì)算、感知和認(rèn)知3個(gè)階段[2],大數(shù)據(jù)、大算力和智能算法為研究認(rèn)知智能提供了先決條件。從人工智能技術(shù)發(fā)展的角度來(lái)看,計(jì)算智能主要以科學(xué)運(yùn)算、邏輯處理、統(tǒng)計(jì)查詢等形式化規(guī)則化運(yùn)算為核心,能存會(huì)算會(huì)查找。感知智能主要以圖像理解、語(yǔ)音識(shí)別、機(jī)器翻譯為代表,基于深度學(xué)習(xí)模型,能聽會(huì)說(shuō)能看會(huì)認(rèn)。認(rèn)知智能主要以理解、推理、思考和決策為代表,強(qiáng)調(diào)認(rèn)知推理,自主學(xué)習(xí)能力,能理解會(huì)思考決策。博弈智能作為決策智能的前沿范式,是認(rèn)知智能的高階表現(xiàn)形式,其主要以博弈論為理論支撐,以反事實(shí)因果推理、可解釋性決策為表現(xiàn)形式,強(qiáng)調(diào)將其他智能體(隊(duì)友及對(duì)手)納入己方的決策環(huán)進(jìn)行規(guī)則自學(xué)習(xí)、博弈對(duì)抗演化、可解釋性策略推薦等。當(dāng)前,博弈智能已然成為人工智能領(lǐng)域的前沿方面、通用人工智能的重要問(wèn)題。

    多智能體系統(tǒng)一般是指由多個(gè)獨(dú)立的智能體組成的分布式系統(tǒng),每個(gè)智能體均受到獨(dú)立控制,但需在同一個(gè)環(huán)境中與其他智能體交互[3]。Shoham等人[4]將多智能體系統(tǒng)定義為包含多個(gè)自治實(shí)體的系統(tǒng),這些實(shí)體要么有不同的信息,要么有不同的興趣,或兩者兼有。Muller等人[5]對(duì)由多智能體系統(tǒng)技術(shù)驅(qū)動(dòng)的各個(gè)領(lǐng)域的152個(gè)真實(shí)應(yīng)用進(jìn)行了分類總結(jié)和分析。多智能體系統(tǒng)是分布式人工智能的一個(gè)重要分支,主要研究智能體之間的交互通信、協(xié)調(diào)合作、沖突消解等方面的內(nèi)容,強(qiáng)調(diào)多個(gè)智能體之間的緊密群體合作,而非個(gè)體能力的自治和發(fā)揮。智能體之間可能存在對(duì)抗、競(jìng)爭(zhēng)或合作關(guān)系,單個(gè)智能體可通過(guò)信息交互與友方進(jìn)行協(xié)調(diào)配合,一同對(duì)抗敵對(duì)智能體。由于每個(gè)智能體均能夠自主學(xué)習(xí),多智能體系統(tǒng)通常表現(xiàn)出涌現(xiàn)性能力。

    當(dāng)前,多智能體系統(tǒng)模型常用于描述共享環(huán)境下多個(gè)具有感知、計(jì)算、推理和行動(dòng)能力的自主個(gè)體組成的集合,典型應(yīng)用包括各類機(jī)器博弈、拍賣、在線平臺(tái)交易、資源分配(路由包、服務(wù)器分配)、機(jī)器人足球、無(wú)線網(wǎng)絡(luò)、多方協(xié)商、多機(jī)器人災(zāi)難救援、自動(dòng)駕駛和無(wú)人集群對(duì)抗等。其中,基于機(jī)器博弈(計(jì)算機(jī)博弈)的人機(jī)對(duì)抗,作為圖靈測(cè)試的典型范式[6],是研究人工智能的果蠅[7]。多智能體系統(tǒng)被廣泛用于解決分布式?jīng)Q策優(yōu)化問(wèn)題,其成功的關(guān)鍵是高效的多智能體學(xué)習(xí)方法。多智能體學(xué)習(xí)主要研究由多個(gè)自主個(gè)體組成的多智能體系統(tǒng)如何通過(guò)學(xué)習(xí)探索、利用經(jīng)驗(yàn)提升自身性能的過(guò)程[8]。如何通過(guò)博弈策略學(xué)習(xí)提高多智能體系統(tǒng)的自主推理與決策能力是人工智能和博弈論領(lǐng)域面臨的前沿挑戰(zhàn)。

    1 多智能體學(xué)習(xí)簡(jiǎn)介

    多智能體學(xué)習(xí)是人工智能研究的前沿?zé)狳c(diǎn)。從第三次人工智能浪潮至今,社會(huì)各界對(duì)多智能體學(xué)習(xí)的相關(guān)研究產(chǎn)生了極大的興趣。多智能體學(xué)習(xí)在人工智能、博弈論、機(jī)器人和心理學(xué)領(lǐng)域得到了廣泛研究。面對(duì)參與實(shí)體數(shù)量多、狀態(tài)空間規(guī)模大、實(shí)時(shí)決策巨復(fù)雜等現(xiàn)實(shí)問(wèn)題,多智能體如何建模變得困難,手工設(shè)計(jì)的智能體交互行為遷移性比較弱。相反,基于認(rèn)知行為建模的智能體能夠從與環(huán)境及其他智能體的交互經(jīng)驗(yàn)中學(xué)會(huì)有效地提升自身行為。在學(xué)習(xí)過(guò)程中,智能體可以學(xué)會(huì)與其他智能體進(jìn)行協(xié)調(diào),學(xué)習(xí)選擇自身行為、其他智能體如何選擇行為以及其目標(biāo)、計(jì)劃和信念是什么等。

    本文從博弈論視角分析多智能體學(xué)習(xí),第1節(jié)簡(jiǎn)要介紹了多智能體學(xué)習(xí),主要包括多智能體系統(tǒng)組成、多智能體學(xué)習(xí)概述、多智能體學(xué)習(xí)研究分類;第2節(jié)重點(diǎn)介紹了多智能體博弈學(xué)習(xí)框架,包括博弈基礎(chǔ)模型及元博弈模型、博弈解概念及博弈動(dòng)力學(xué)、多智能體博弈學(xué)習(xí)的挑戰(zhàn);第3節(jié),全面梳理了多智能體博弈策略學(xué)習(xí)方法,重點(diǎn)剖析了策略學(xué)習(xí)框架、離線博弈策略學(xué)習(xí)方法和在線博弈策略學(xué)習(xí)方法;第4節(jié)著重從智能體認(rèn)知行為建模與協(xié)同、通用博弈策略學(xué)習(xí)方法和分布式博弈策略學(xué)習(xí)框架共3個(gè)方面展望了多智能體學(xué)習(xí)研究前沿;最后對(duì)全文進(jìn)行了總結(jié)。整體架構(gòu)如圖1所示。

    近年來(lái),伴隨著深度學(xué)習(xí)(感知領(lǐng)域)和強(qiáng)化學(xué)習(xí)(決策領(lǐng)域)的深度融合發(fā)展,多智能體學(xué)習(xí)方法在機(jī)器博弈領(lǐng)域取得了長(zhǎng)足進(jìn)步,如圖2所示,AlphaGo[9]和Muzero[10],DeepStack[11]及德州撲克[12],DeltaDou[13]及斗地主[14],麻將[15],AlphaStar[16]及星際爭(zhēng)霸[17],OpenAI Five[18]及絕悟[19],AlphaWar[20]及戰(zhàn)顱[21],ALPHA[22]及AlphaDogFight空戰(zhàn)[23]等人工智能在各類比賽中獲得較好名次或在人機(jī)對(duì)抗比賽中戰(zhàn)勝了人類頂級(jí)選手。

    1.1 多智能體學(xué)習(xí)系統(tǒng)組成

    多智能體學(xué)習(xí)系統(tǒng)共包含四大模塊:環(huán)境、智能體、交互機(jī)制和學(xué)習(xí)方法。當(dāng)前針對(duì)多智能體學(xué)習(xí)的相關(guān)研究主要是圍繞這四部分展開的,如圖3所示。

    環(huán)境模塊由狀態(tài)空間、動(dòng)作空間、轉(zhuǎn)換函數(shù)和獎(jiǎng)勵(lì)函數(shù)構(gòu)成。狀態(tài)空間指定單個(gè)智能體在任何給定時(shí)間可以處于的一組狀態(tài)。動(dòng)作空間是單個(gè)智能體在任何給定時(shí)間可用的一組動(dòng)作,轉(zhuǎn)換函數(shù)或環(huán)境動(dòng)力學(xué)指定了環(huán)境在給定狀態(tài)下執(zhí)行動(dòng)作的每個(gè)智能體(或智能體的子集)改變的(可能是隨機(jī)的)方式,獎(jiǎng)勵(lì)函數(shù)根據(jù)狀態(tài)-行動(dòng)轉(zhuǎn)換結(jié)果給出獎(jiǎng)勵(lì)反饋信號(hào)。智能體模塊需要定義其與環(huán)境的通信關(guān)系,用于獲取觀測(cè)狀態(tài)和輸出指定動(dòng)作、智能體之間的行為通信方式、其效用函數(shù)以表征環(huán)境狀態(tài)偏好以及選擇行動(dòng)的策略。學(xué)習(xí)模塊由學(xué)習(xí)實(shí)體、學(xué)習(xí)目標(biāo)、學(xué)習(xí)經(jīng)驗(yàn)數(shù)據(jù)、學(xué)習(xí)更新和學(xué)習(xí)目標(biāo)定義。學(xué)習(xí)實(shí)體需要指定單智能體還是多智能體級(jí)別。學(xué)習(xí)目標(biāo)描述了正在學(xué)習(xí)的任務(wù)目標(biāo),通常表現(xiàn)為目標(biāo)或評(píng)價(jià)函數(shù)。學(xué)習(xí)經(jīng)驗(yàn)數(shù)據(jù)描述了學(xué)習(xí)實(shí)體可以獲得哪些信息作為學(xué)習(xí)的基礎(chǔ)。學(xué)習(xí)更新定義了在學(xué)習(xí)過(guò)程中學(xué)習(xí)實(shí)體的更新規(guī)則。交互機(jī)制模塊定義了智能體相互交互多長(zhǎng)時(shí)間,與哪些其他智能體交互,及其對(duì)其他智能體的觀察。交互機(jī)制還規(guī)定了任何給定智能體之間交互的頻率(或數(shù)量),及其動(dòng)作是同時(shí)選擇還是順序選擇(動(dòng)作選擇的定時(shí))。

    1.2 多智能體學(xué)習(xí)概述

    “多智能體學(xué)習(xí)”需要研究的問(wèn)題是指導(dǎo)和開展研究的指南。Stone等人[24]在2000年就從機(jī)器學(xué)習(xí)的角度綜述分析了多智能體系統(tǒng),主要考慮智能體是同質(zhì)還是異質(zhì),是否可以通信等4種情型。早期相關(guān)綜述文章[25-29]采用公開辯論的方法分別從不同的角度對(duì)多智能體學(xué)習(xí)問(wèn)題進(jìn)行剖析,總結(jié)出多智能體學(xué)習(xí)的4個(gè)明確定義問(wèn)題:?jiǎn)栴}描述、分布式人工智能、博弈均衡和智能體建模[26]。Shoham等人[27]從強(qiáng)化學(xué)習(xí)和博弈論視角自省式提出了“如果多智能體學(xué)習(xí)是答案,那么問(wèn)題是什么?”由于沒有找到一個(gè)單一的答案,他們提出了未來(lái)人工智能研究主要圍繞4個(gè)“主題”:計(jì)算性、描述性、規(guī)范性、規(guī)定性。其中,規(guī)定性又分為分布式、均衡和智能體,此3項(xiàng)如今正指引著多智能體學(xué)習(xí)的研究。Stone[28]試圖回答Shoham的問(wèn)題,但看法剛好相反,強(qiáng)調(diào)多智能體學(xué)習(xí)應(yīng)包含博弈論,如何應(yīng)用多智能學(xué)習(xí)技術(shù)仍然是一個(gè)開放問(wèn)題,而沒有一個(gè)標(biāo)準(zhǔn)的答案。Tosic等人[29]在2010年就提出了面向多智能體的強(qiáng)化學(xué)習(xí)、協(xié)同學(xué)習(xí)和元學(xué)習(xí)統(tǒng)一框架。Tuyls等人[8]在2018年分析了多智能體學(xué)習(xí)需要研究的5種方法:面向個(gè)人收益的在線強(qiáng)化學(xué)習(xí)、面向社會(huì)福利的在線強(qiáng)化學(xué)習(xí)、協(xié)同演化方法、群體智能和自適應(yīng)機(jī)制設(shè)計(jì)。Tuyls等人[30]在后續(xù)的研究中指出應(yīng)將群體智能[31]、協(xié)同演化[32]、遷移學(xué)習(xí)[33]、非平穩(wěn)性[34]、智能體建模[35]等納入多智能體學(xué)習(xí)方法框架中研究。多智能體學(xué)習(xí)的主流方法主要包括強(qiáng)化學(xué)習(xí)、演化學(xué)習(xí)和元學(xué)習(xí)等內(nèi)容,如圖4所示。

    1.3 多智能體學(xué)習(xí)研究方法分類

    根據(jù)對(duì)多智能體學(xué)習(xí)問(wèn)題的分類描述,可以區(qū)分不同的研究視角與方法。Jant等人[36]很早就從合作與競(jìng)爭(zhēng)兩個(gè)角度對(duì)多智能體學(xué)習(xí)問(wèn)題進(jìn)行了區(qū)分。Panait等人[37]對(duì)合作型多智能體學(xué)習(xí)方法進(jìn)行了概述:團(tuán)隊(duì)學(xué)習(xí),指多智能體以公共的、唯一的學(xué)習(xí)機(jī)制集中學(xué)習(xí)最優(yōu)聯(lián)合策略;并發(fā)學(xué)習(xí),指單個(gè)智能體以相同或不同的個(gè)體學(xué)習(xí)機(jī)制,并發(fā)學(xué)習(xí)最優(yōu)個(gè)體策略。最新研究直接利用多智能體強(qiáng)化學(xué)習(xí)[38-41]方法開展研究。Busoniu等人[38]首次從完全合作、完全競(jìng)爭(zhēng)和混合3類任務(wù)的角度對(duì)多智能體強(qiáng)化學(xué)習(xí)方法進(jìn)行了分類總結(jié)。Hernandez-Leal等人[39]總結(jié)了傳統(tǒng)多智能體系統(tǒng)研究中的經(jīng)典思想(如涌現(xiàn)性行為、學(xué)會(huì)通信交流和對(duì)手建模)是如何融入深度多智能體強(qiáng)化學(xué)習(xí)領(lǐng)域的,并在此基礎(chǔ)上對(duì)深度強(qiáng)化學(xué)習(xí)進(jìn)行了分類。Oroojlooy等人[40]從獨(dú)立學(xué)習(xí)器、全可觀評(píng)價(jià)、值函數(shù)分解、一致性和學(xué)會(huì)通信協(xié)調(diào)5個(gè)方面對(duì)合作多智能體強(qiáng)化學(xué)習(xí)方法進(jìn)行了全面回顧分析。Zhang等人[41]對(duì)具有理論收斂性保證和復(fù)雜性分析的多智能體強(qiáng)化學(xué)習(xí)算法進(jìn)行了選擇性分析,并首次對(duì)聯(lián)網(wǎng)智能體分散式、平均場(chǎng)博弈和隨機(jī)勢(shì)博弈多智能體強(qiáng)化學(xué)習(xí)方法進(jìn)行綜述分析。Gronauer等人[42]從訓(xùn)練范式與執(zhí)行方案、智能體涌現(xiàn)性行為模式和智能體面臨的六大挑戰(zhàn),即環(huán)境非平穩(wěn)、部分可觀、智能體之間的通信、協(xié)調(diào)、可擴(kuò)展性、信度分配,分析了多智能體深度強(qiáng)化學(xué)習(xí)。Du等人[43]從通信學(xué)習(xí)、智能體建模、面向可擴(kuò)展性的分散式訓(xùn)練分散式執(zhí)行及面向部分可觀性的集中式訓(xùn)練分散式訓(xùn)練兩種范式等角度對(duì)多智能體深度強(qiáng)化學(xué)習(xí)進(jìn)行了綜述分析。

    在國(guó)內(nèi),吳軍等人[44]從模型的角度出發(fā),對(duì)面向馬爾可夫決策過(guò)程(Markov decision process, MDP)的集中式和分散式模型,面向馬爾可夫博弈(Markov game, MG)的共同回報(bào)隨機(jī)博弈,零和隨機(jī)博弈和一般和隨機(jī)博弈,共5類模型進(jìn)行了分類分析。杜威等人[45]從完全合作、完全競(jìng)爭(zhēng)和混合型3類任務(wù)分析了多智能體強(qiáng)化學(xué)習(xí)方法。殷昌盛等人[46]對(duì)多智能體分層強(qiáng)化學(xué)習(xí)方法做了綜述分析。梁星星等人[47]對(duì)從全通信集中決策、全通信自主決策和欠通信自主決策3種范式對(duì)多智能體深度強(qiáng)化學(xué)習(xí)方法進(jìn)行綜述分析。孫長(zhǎng)銀等人[48]從學(xué)習(xí)算法結(jié)構(gòu)、環(huán)境非靜態(tài)性、部分可觀性、基于學(xué)習(xí)的通信和算法穩(wěn)定性與收斂性共5個(gè)方面分析了多智能體強(qiáng)化學(xué)習(xí)需要研究的重點(diǎn)問(wèn)題。

    2 多智能體博弈學(xué)習(xí)框架

    博弈論可用于多智能體之間的策略交互建模,近年來(lái),基于博弈論的學(xué)習(xí)方法被廣泛嵌入到多智能體的相關(guān)研究問(wèn)題中,多智能體博弈學(xué)習(xí)已然成為當(dāng)前一種新的研究范式。Matignon等人[49]僅對(duì)合作MG的獨(dú)立強(qiáng)化學(xué)習(xí)方法做了綜述分析。Nowe等人[50]從無(wú)狀態(tài)博弈、團(tuán)隊(duì)MG和一般MG三類場(chǎng)景對(duì)多智能體獨(dú)立學(xué)習(xí)和聯(lián)合學(xué)習(xí)方法進(jìn)行了分類總結(jié)。Lu等人[51]從強(qiáng)化學(xué)習(xí)和博弈論的整體視角出發(fā)對(duì)多智能體博弈的解概念、虛擬自對(duì)弈(fictitious self-play, FSP)類方法和反事實(shí)后悔值最小化(counterfactual regret minimization, CFR)類方法進(jìn)行了全面綜述分析。Yang等人[52]對(duì)同等利益博弈、零和博弈、一般和博弈和平均場(chǎng)博弈中的學(xué)習(xí)方法進(jìn)行了分類總結(jié)。Bloembergen等人[53]利用演化博弈學(xué)習(xí)方法分析了各類多智能體強(qiáng)化學(xué)習(xí)方法的博弈動(dòng)態(tài),并揭示了演化博弈論和多智能體化學(xué)習(xí)方法之間的深刻聯(lián)系。另外,Wong等人[54]從多智能體深度強(qiáng)化學(xué)習(xí)面臨的四大挑戰(zhàn)出發(fā),指出未來(lái)需要研究深化學(xué)習(xí)等類人學(xué)習(xí)的方法。

    2.1 多智能體博弈基礎(chǔ)模型及元博弈

    2.1.1 多智能體博弈基礎(chǔ)模型

    MDP常用于人工智能領(lǐng)域單智能體決策問(wèn)題的過(guò)程建模?;跊Q策論的多智能體模型主要有分散式MDP(decentralized MDP, Dec-MDP)及多智能體MDP(multi-agent MDP, MMDP)[55]。其中,Dec-MDP模型每個(gè)智能體獨(dú)立擁有關(guān)于環(huán)境狀態(tài)的觀測(cè),并根據(jù)觀測(cè)到的局部信息選擇自身動(dòng)作;MMDP模型不區(qū)分單個(gè)智能體可利用的私有和全局狀態(tài)信息,采用集中式選擇行動(dòng)策略,然后分配給單個(gè)智能體去執(zhí)行;分散式部分可觀MDP(decentralized partial observable MDP, Dec-POMDP)關(guān)注不確定性條件(動(dòng)作和觀測(cè))下多智能體的動(dòng)作選擇與協(xié)調(diào)。Dec-POMDP模型中智能體的決策是分散式的,每個(gè)智能體根據(jù)自身所獲得的局部觀測(cè)信息獨(dú)立地做出決策。利用遞歸建模方法對(duì)其他智能體的行為進(jìn)行顯式的建模,Doshi等[56]提出的交互式部分POMDP(interactive-POMDP, I-POMDP) 模型,綜合利用博弈論與決策論來(lái)建模問(wèn)題。早在20世紀(jì)50年代,由Shapley提出的隨機(jī)博弈[57],通常也稱作MG[58],常被用來(lái)描述多智能體學(xué)習(xí)。當(dāng)前的一些研究將決策論與博弈論統(tǒng)合起來(lái),認(rèn)為兩類模型都屬于部分可觀隨機(jī)博弈模型[59]。從博弈論視角來(lái)分析,兩大典型博弈模型: 隨機(jī)博弈和擴(kuò)展式博弈模型如圖5所示。最新的一些研究利用因子可觀隨機(jī)博弈模型來(lái)建模擴(kuò)展式博弈[58],探索利用強(qiáng)化學(xué)習(xí)等方法求解擴(kuò)展式博弈。

    隨機(jī)博弈模型可分為面向合作的團(tuán)隊(duì)博弈模型、面向競(jìng)爭(zhēng)對(duì)抗的零和博弈模型和面向競(jìng)合(混合)的一般和模型,如圖6所示。其中,團(tuán)隊(duì)博弈可廣泛用于對(duì)抗環(huán)境下的多智能體的合作交互建模,如即時(shí)策略游戲、無(wú)人集群對(duì)抗、聯(lián)網(wǎng)車輛調(diào)度等;零和博弈和一般和博弈常用于雙方或多方交互建模。擴(kuò)展式博弈包括兩種子類型,正則式表示[4]常用于同時(shí)行動(dòng)交互場(chǎng)景描述,序貫式表示[60]常用于行為策略多階段交互場(chǎng)景描述,回合制博弈[61]常用于雙方交替決策場(chǎng)景。

    2.1.2 元博弈模型

    元博弈,即博弈的博弈,常用于博弈策略空間分析[62],是研究經(jīng)驗(yàn)博弈理論分析(empirical game theoretic analysis,EGTA)的基礎(chǔ)模型[63]。目前,已廣泛應(yīng)用于各種可采用模擬器仿真的現(xiàn)實(shí)場(chǎng)景:供應(yīng)鏈管理分析、廣告拍賣和能源市場(chǎng);設(shè)計(jì)網(wǎng)絡(luò)路由協(xié)議,公共資源管理;對(duì)抗策略選擇、博弈策略動(dòng)態(tài)分析等。博弈論與元博弈的相關(guān)要素對(duì)比如表1所示。

    近年來(lái),一些研究對(duì)博弈的策略空間幾何形態(tài)進(jìn)行了探索。Jiang等人[64]首次利用組合霍奇理論研究圖霍爾海姆茨分解。Candogan等人[65]探索了策略博弈的流表示,提出策略博弈主要由勢(shì)部分、調(diào)和部分和非策略部分組成。Hwang等人[66]從策略等價(jià)的角度研究了正則式博弈的分解方法。Balduzzi等人[67]研究提出任何一個(gè)泛函式博弈(functional-form game,F(xiàn)FG)可以做直和分解成傳遞壓制博弈和循環(huán)壓制博弈。對(duì)于對(duì)稱(單種群)零和博弈,可以采用舒爾分解、主成分分析、奇異值分解、t分布隨機(jī)鄰域嵌入等方法分析博弈的策略空間形態(tài)結(jié)構(gòu),如圖7所示。其中,40個(gè)智能體的策略評(píng)估矩陣及二維嵌入,顏色從紅至綠對(duì)應(yīng)歸一化至[-1,1]范圍的平均收益值,完全傳遞壓制博弈的二維嵌入近似一條線、完全循環(huán)壓制博弈的二維嵌入近似一個(gè)環(huán)。

    Omidshafiei等人[7]利用智能體的對(duì)抗數(shù)據(jù),根據(jù)博弈收益,依次繪制響應(yīng)圖、直方圖,得到譜響應(yīng)圖、聚合響應(yīng)圖和收縮響應(yīng)圖,采用圖論對(duì)傳遞博弈與循環(huán)博弈進(jìn)行拓?fù)浞治觯L制智能體的博弈策略特征圖,得出傳遞博弈與循環(huán)博弈特征距離較遠(yuǎn)。Czarnecki等人[68]根據(jù)現(xiàn)實(shí)世界中的各類博弈策略的空間分析提出博弈策略空間的陀螺幾何體模型猜想,如圖8所示,縱向表示傳遞壓制維,幾何體頂端為博弈的納什均衡,表征了策略之間的壓制關(guān)系,橫向表示循環(huán)壓制維,表征了策略之間可能存在的首尾嵌套非傳遞性壓制關(guān)系。

    關(guān)于如何度量博弈策略的循環(huán)性壓制,即非傳遞性壓制,Czarneck等人[68]指出可以采用策略集鄰接矩陣A(每個(gè)節(jié)點(diǎn)代表一個(gè)策略,如果策略i壓制策略j,則Aij=1),通過(guò)計(jì)算dia(A3)可以得到循環(huán)壓制環(huán)長(zhǎng)度為3的策略個(gè)數(shù),但由于節(jié)點(diǎn)可能重復(fù)訪問(wèn),dia(AP)無(wú)法適用于更長(zhǎng)循環(huán)策略。此外,納什聚類方法也可用于分析循環(huán)壓制環(huán)的長(zhǎng)度,其中傳遞性壓制部分對(duì)手策略的索引、循環(huán)壓制對(duì)應(yīng)聚類類別的大小。Sanjaya等人[69]利用真實(shí)的國(guó)際象棋比賽數(shù)據(jù)實(shí)證分析了人類玩家策略的循環(huán)性壓制。此類結(jié)論表明,只有當(dāng)智能的策略種群足夠大后,才能克服循環(huán)性壓制并產(chǎn)生相變,學(xué)習(xí)收斂至更強(qiáng)的近似納什均衡策略。

    Tuyls等人[70]證明了元博弈的納什均衡是原始博弈的2ε納什均衡,并利用Hoeffding給出了批處理單獨(dú)采樣和均勻采樣兩種情況下的均衡概率收斂的有效樣本需求界。Viqueria等人[71]利用Hoeffding界和Rademacher復(fù)雜性分析了元博弈,得出基于仿真學(xué)習(xí)到博弈均衡以很高概率保證是元博弈的近似均衡,同時(shí)元博弈的近似均衡是仿真博弈的近似均衡。

    2.2 均衡解概念與博弈動(dòng)力學(xué)

    2.2.1 均衡解概念

    從博弈論視角分析多智能體學(xué)習(xí)需要對(duì)其中的博弈均衡解概念做細(xì)致分析。許多博弈沒有純納什均衡,但一定存在混合納什均衡,如圖9所示。比較而言,相關(guān)均衡容易計(jì)算,粗相關(guān)均衡非常容易計(jì)算[72]。

    由于學(xué)習(xí)場(chǎng)景和目標(biāo)的差別,一些新的均衡解概念也被采納:面向安全攻防博弈的斯坦克爾伯格均衡[73],面向有限理性的量子響應(yīng)均衡[74],面向演化博弈的演化穩(wěn)定策略[53],面向策略空間博弈的元博弈均衡[75],穩(wěn)定對(duì)抗干擾的魯棒均衡[76]也稱顫抖手均衡[77],處理非完備信息的貝葉斯均衡[78],處理在線決策的無(wú)悔或最小后悔值[79],描述智能體在沒有使其他智能體情況變壞的前提下使得自身策略變好的帕累托最優(yōu)[80],以及面向常和隨機(jī)博弈的馬爾可夫完美均衡[81]等。近來(lái)年,一些研究采用團(tuán)隊(duì)最大最小均衡[82]來(lái)描述零和博弈場(chǎng)景下組隊(duì)智能體對(duì)抗單個(gè)智能體,其本質(zhì)是一類對(duì)抗團(tuán)隊(duì)博弈[83]模型,可用于解決網(wǎng)絡(luò)阻斷[84]類問(wèn)題、多人撲克[85]問(wèn)題和橋牌問(wèn)題[86]。同樣,一些基于“相關(guān)均衡”[87]解概念的新模型相繼被提出,應(yīng)用于元博弈[88]、擴(kuò)展式博弈[89]、一般和博弈[90]、零和同時(shí)行動(dòng)隨機(jī)博弈[91]等。正是由于均衡解的計(jì)算復(fù)雜度比較高,當(dāng)前一些近似均衡的解概念得到了廣泛運(yùn)用,如最佳響應(yīng)[92]和預(yù)言機(jī)[93]等。

    2.2.2 博弈動(dòng)力學(xué)

    博弈原本就是描述個(gè)體之間的動(dòng)態(tài)交互過(guò)程。對(duì)于一般的勢(shì)博弈,從任意一個(gè)局勢(shì)開始,最佳響應(yīng)動(dòng)力學(xué)可確保收斂到一個(gè)純納什均衡[94]。最佳響應(yīng)動(dòng)力學(xué)過(guò)程十分直接,每個(gè)智能體可以通過(guò)連續(xù)性的單方策略改變來(lái)搜索博弈的純策略納什均衡。

    最佳響應(yīng)動(dòng)力學(xué):只要當(dāng)前的局勢(shì)s不是一個(gè)純納什均衡,任意選擇一個(gè)智能體i以及一個(gè)對(duì)有利的策略改變s′i,然后更新局勢(shì)為(s′i,s-i)。

    由于最佳響應(yīng)動(dòng)力學(xué)只能收斂到一個(gè)純策略納什均衡且與勢(shì)博弈緊密相關(guān),但在任意有限博弈中,無(wú)悔學(xué)習(xí)動(dòng)力學(xué)可確保收斂到粗相關(guān)均衡[95]。對(duì)任意時(shí)間點(diǎn)t=1,2,…,T,假定每個(gè)智能體i獲得的收益向量cti,給定其他智能體的混合策略σt-i=∏j≠iptj,每個(gè)智能體i使用無(wú)悔算法獨(dú)立地選擇一個(gè)混合策略pti,則智能體選擇純策略si的期望收益:πti(si)=Est-i~σt-i[πi(si,st-i)]。

    無(wú)悔學(xué)習(xí)方法:如果對(duì)于任意ε>0,都存在一個(gè)充分大的時(shí)間域T=T(ε)使得對(duì)于在線決策算法M的任意對(duì)手,決策者的后悔值最多為ε,將稱方法M為無(wú)悔的。

    無(wú)交換后悔動(dòng)力學(xué)可確保學(xué)習(xí)收斂至相關(guān)均衡[63]。相關(guān)均衡與無(wú)交換后悔動(dòng)力的聯(lián)系與粗相關(guān)均衡和無(wú)悔動(dòng)力學(xué)的聯(lián)系一樣。

    無(wú)交換后悔學(xué)習(xí)方法:如果對(duì)于任意ε>0,都存在一個(gè)充分大的時(shí)間域T=T(ε)使得對(duì)于在線決策方法M的任意對(duì)手,決策者的期望交換后悔值最多為ε,將稱方法M為無(wú)交換后悔的。

    對(duì)于多智能體之間的動(dòng)態(tài)交互一般可以采用種群演化博弈理論里的復(fù)制者動(dòng)態(tài)方程[53]或偏微分方程[96]進(jìn)行描述。Leonardos等人[97]利用突變理論證明了軟Q學(xué)習(xí)在異質(zhì)學(xué)習(xí)智能體的加權(quán)勢(shì)博弈中總能收斂到量子響應(yīng)均衡。

    2.3 多智能體博弈學(xué)習(xí)的挑戰(zhàn)

    2.3.1 學(xué)習(xí)目標(biāo)多樣

    學(xué)習(xí)目標(biāo)支配著多智能體學(xué)習(xí)的整個(gè)過(guò)程,為學(xué)習(xí)方法的評(píng)估提供了依據(jù)。Powers等人[98]在2004年將多智能體學(xué)習(xí)的學(xué)習(xí)目標(biāo)歸類為:理性、收斂性、安全性、一致性、相容性、目標(biāo)最優(yōu)性等。Busoniu等人[38]將學(xué)習(xí)的目標(biāo)歸納為兩大類:穩(wěn)定性(收斂性、均衡學(xué)習(xí)、可預(yù)測(cè)、對(duì)手無(wú)關(guān)性)和適應(yīng)性(理性、無(wú)悔、目標(biāo)最優(yōu)性、安全性、對(duì)手察覺)。Digiovanni等人[99]將帕累托有效性也看作是多智能體學(xué)習(xí)目標(biāo)。多智能體學(xué)習(xí)目標(biāo)如表2所示,穩(wěn)定性表征了學(xué)習(xí)到一個(gè)平穩(wěn)策略的能力,收斂到某個(gè)均衡解,可學(xué)習(xí)近似模型用于預(yù)測(cè)推理,學(xué)習(xí)到的平穩(wěn)策略與對(duì)手無(wú)關(guān);適應(yīng)性表征了智能體能夠根據(jù)所處環(huán)境,感知對(duì)手狀態(tài),理性分析對(duì)手模型,做出最佳響應(yīng),在線博弈時(shí)可以學(xué)習(xí)一個(gè)回報(bào)不差于平穩(wěn)策略的無(wú)悔響應(yīng);目標(biāo)最優(yōu)、相容性與帕累托有效性、安全性表征了其他智能體可能采用固定策略、自對(duì)弈學(xué)習(xí)方法時(shí),當(dāng)前智能體仍能適應(yīng)變化的對(duì)手,達(dá)到目標(biāo)最優(yōu)的適應(yīng)性要求。

    2.3.2 環(huán)境(對(duì)手)非平穩(wěn)

    多智能體學(xué)習(xí)過(guò)程中,環(huán)境狀態(tài)和獎(jiǎng)勵(lì)都是由所有智能體的動(dòng)作共同決定的;各智能體的策略都根據(jù)獎(jiǎng)勵(lì)同時(shí)優(yōu)化;每個(gè)智能體只能控制自身策略。基于這3個(gè)特點(diǎn),非平穩(wěn)性成為影響多智能體學(xué)習(xí)求解最優(yōu)聯(lián)合策略的阻礙,并發(fā)學(xué)習(xí)的非平穩(wěn)性包括策略非平穩(wěn)性和個(gè)體策略學(xué)習(xí)環(huán)境非平穩(wěn)性。當(dāng)某個(gè)智能體根據(jù)其他智能體的策略調(diào)整自身策略以求達(dá)到更好的協(xié)作效果時(shí),其他智能體也相應(yīng)地為了適應(yīng)該智能體的策略調(diào)整了自己的策略,這就導(dǎo)致該智能體調(diào)整策略的依據(jù)已經(jīng)“過(guò)時(shí)”,從而無(wú)法達(dá)到良好的協(xié)調(diào)效果。從優(yōu)化的角度看,其他智能體策略的非平穩(wěn)性導(dǎo)致智能體自身策略的優(yōu)化目標(biāo)是動(dòng)態(tài)的,從而造成各智能體策略相互適應(yīng)的滯后性。非平穩(wěn)性作為多智能體問(wèn)題面臨的最大挑戰(zhàn),如圖10所示,當(dāng)前的處理方法主要有五大類:無(wú)視[109],假設(shè)環(huán)境(對(duì)手)是平穩(wěn)的;遺忘[110],采用無(wú)模型方法,忘記過(guò)去的信息同時(shí)更新最新的觀測(cè);標(biāo)定對(duì)手模型[111],針對(duì)預(yù)定義對(duì)手進(jìn)行己方策略優(yōu)化;學(xué)習(xí)對(duì)手模型的方法[112],采用基于模型的學(xué)習(xí)方法學(xué)習(xí)對(duì)手行動(dòng)策略;基于心智理論的遞歸推理方法[113],智能體采用認(rèn)知層次理論遞歸推理雙方策略。

    面對(duì)有限理性或欺騙型對(duì)手,對(duì)手建模(也稱智能體建模)已然成為智能體博弈對(duì)抗時(shí)必須擁有的能力[114],同集中式訓(xùn)練分散式執(zhí)行、元學(xué)習(xí)、多智能體通信建模為非平穩(wěn)問(wèn)題的處理提供了技術(shù)支撐[115]。

    2.3.3 均衡難解且易變

    由于狀態(tài)和數(shù)量的增加,多智能體學(xué)習(xí)問(wèn)題的計(jì)算復(fù)雜度比較大。計(jì)算兩人(局中人常用于博弈模型描述,智能體常用于學(xué)習(xí)類模型描述,本文部分語(yǔ)境中兩者等價(jià))零和博弈的納什均衡解是多項(xiàng)式時(shí)間內(nèi)可解問(wèn)題[94],兩人一般和博弈的納什均衡解是有向圖的多項(xiàng)式奇偶性論據(jù)(polynomial parity argument on directed graphs, PPAD)難問(wèn)題[116],納什均衡的存在性判定問(wèn)題是非確定性多項(xiàng)式(non-deterministic polynomial, NP)時(shí)間計(jì)算難問(wèn)題[117],隨機(jī)博弈的純策略納什均衡存在性判定問(wèn)題是多項(xiàng)式空間(polynomial space, PSPACE)難問(wèn)題[118]。多人博弈更是面臨“納什均衡存在性”“計(jì)算復(fù)雜度高”“均衡選擇難”等挑戰(zhàn)。

    對(duì)于多智能體場(chǎng)景,如果采用每個(gè)智能體獨(dú)立計(jì)算納什均衡策略,那么策略組合可能并不是真實(shí)的全體納什均衡,且個(gè)別智能體可能具有多個(gè)均衡策略、偏離動(dòng)機(jī)等。檸檬水站位博弈[119]如圖11所示,每個(gè)智能體需要在圓環(huán)中找到一個(gè)站位,使自己與其他所有智能體的距離總和最大(見圖11(a)),所有智能體沿環(huán)均勻分布就是納什均衡,由于這種分布有無(wú)限多種方式實(shí)現(xiàn),因此納什均衡的個(gè)數(shù)無(wú)限多,原問(wèn)題變成了“均衡選擇”問(wèn)題,但如果每個(gè)智能體都獨(dú)立計(jì)算各自的納什均衡策略,那么組合策略可能并非整體的納什均衡策略(見圖11(b))。

    正是由于多維目標(biāo)、非平穩(wěn)環(huán)境、大規(guī)模狀態(tài)行為空間、不完全信息與不確定性因素等影響,高度復(fù)雜的多智能體學(xué)習(xí)問(wèn)題面臨諸多挑戰(zhàn),已然十分難以求解。

    3 多智能體博弈學(xué)習(xí)方法

    根據(jù)多智能體博弈對(duì)抗的場(chǎng)景(離線和在線)的不同,可以將多智能體博弈策略學(xué)習(xí)方法分為離線學(xué)習(xí)預(yù)訓(xùn)練/藍(lán)圖策略的方法與在線學(xué)習(xí)適變/反制策略的方法等。

    3.1 離線場(chǎng)景博弈策略學(xué)習(xí)方法

    3.1.1 隨機(jī)博弈策略學(xué)習(xí)方法

    當(dāng)前,直接面向博弈均衡的學(xué)習(xí)方法主要為一類基于值函數(shù)的策略學(xué)習(xí)。根據(jù)博弈類型(合作博弈、零和博弈及一般和博弈)的不同均衡學(xué)習(xí)方法主要分為三大類,如表3所示。其中,Team Q[106]是一種直接學(xué)習(xí)聯(lián)合策略的方法;Distributed Q[120]采用樂(lè)觀單調(diào)更新本地策略,可收斂到最優(yōu)聯(lián)合策略;JAL(joint action learner)[121]方法通過(guò)將強(qiáng)化學(xué)習(xí)與均衡學(xué)習(xí)方法相結(jié)合來(lái)學(xué)習(xí)自己的行動(dòng)與其他智能體的行動(dòng)值函數(shù);OAL(optimal adaptive learning)方法[122]是一種最優(yōu)自適應(yīng)學(xué)習(xí)方法,通過(guò)構(gòu)建弱非循環(huán)博弈來(lái)學(xué)習(xí)博弈結(jié)構(gòu),消除所有次優(yōu)聯(lián)合動(dòng)作,被證明可以收斂至最優(yōu)聯(lián)合策略;Decentralized Q[123]是一類基于OAL的方法,被證明可漸近收斂至最優(yōu)聯(lián)合策略。Minimax Q方法[106]應(yīng)用于兩人零和隨機(jī)博弈。Nash Q方法[124]將Minimax Q方法從零和博弈擴(kuò)展到多人一般和博弈;相關(guān)均衡(correlated equilibrium,CE)Q方法[125]是一類圍繞相關(guān)均衡的多智能體Q學(xué)習(xí)方法;Asymmetric Q[126] 是一類圍繞斯坦克爾伯格均衡的多智能體Q學(xué)習(xí)方法;敵或友Q(friend-or-foe Q, FFQ)學(xué)習(xí)方法[127]將其他所有智能體分為兩組,一組為朋友,可幫助一起最大化獎(jiǎng)勵(lì)回報(bào),另一組為敵人,試圖降低獎(jiǎng)勵(lì)回報(bào);贏或快學(xué)(win or learn fast, WoLF)方法[100]通過(guò)設(shè)置有利和不利兩種情況下的策略更新步長(zhǎng)學(xué)習(xí)最優(yōu)策略。此外,這類方法還有無(wú)窮小梯度上升(infinitesimal gradient ascent, IGA)[128]、廣義IGA(gene-ralized IGA, GIGA)[129]、適應(yīng)平衡或均衡(adapt when eve-rybody is stationary otherwise move to equilibrirm,AWE-SOME)[130]等。

    當(dāng)前,多智能體強(qiáng)化學(xué)習(xí)方法得到了廣泛研究,但此類方法的學(xué)習(xí)目標(biāo)是博弈最佳響應(yīng)。研究人員陸續(xù)采用獨(dú)立學(xué)習(xí)、聯(lián)合學(xué)習(xí)、集中式訓(xùn)練分散式執(zhí)行、利用協(xié)作圖等多種方法設(shè)計(jì)多智能體強(qiáng)化學(xué)習(xí)方法。本文根據(jù)訓(xùn)練和執(zhí)行方式,將多智能體強(qiáng)化學(xué)習(xí)方法分為四類:完全分散式、完全集中式、集中式訓(xùn)練分散式執(zhí)行和聯(lián)網(wǎng)分散式訓(xùn)練。如表4所示。

    對(duì)于完全分散式學(xué)習(xí)方法,研究者們?cè)讵?dú)立Q學(xué)習(xí)方法的基礎(chǔ)上進(jìn)行了價(jià)值函數(shù)更新方式的改進(jìn),Distributed Q學(xué)習(xí)方法[131],將智能體的個(gè)體動(dòng)作價(jià)值函數(shù)視為聯(lián)合動(dòng)作價(jià)值函數(shù)的樂(lè)觀映射,設(shè)置價(jià)值函數(shù)只有在智能體與環(huán)境和其他智能體的交互使對(duì)應(yīng)動(dòng)作的價(jià)值函數(shù)增大時(shí)才更新。而Hysteretic Q學(xué)習(xí)方法[132]通過(guò)啟發(fā)式信息區(qū)分“獎(jiǎng)勵(lì)”和“懲罰”兩種情況,分別設(shè)置兩個(gè)差別較大的學(xué)習(xí)率克服隨機(jī)變化的環(huán)境狀態(tài)和多最優(yōu)聯(lián)合策略情況。頻率最大Q值(frequency maximum Q, FMQ)方法[133]引入最大獎(jiǎng)勵(lì)頻率這一啟發(fā)信息,使智能體在進(jìn)行動(dòng)作選擇時(shí)傾向曾經(jīng)導(dǎo)致最大獎(jiǎng)勵(lì)的動(dòng)作,鼓勵(lì)智能體的個(gè)體策略函數(shù)通過(guò)在探索時(shí)傾向曾經(jīng)頻繁獲得最大獎(jiǎng)勵(lì)的策略,提高與其他智能體策略協(xié)調(diào)的可能性。Lenient式多智能體強(qiáng)化學(xué)習(xí)方法[134]采用忽略低回報(bào)行為的寬容式學(xué)習(xí)方法。Distributed Lenient Q[135]采用分布式的方法組織Lenient值函數(shù)的學(xué)習(xí)。

    對(duì)于完全集中式學(xué)習(xí)方法,通信網(wǎng)絡(luò)(communication network, CommNet)方法[136]是一種基于中心化的多智能體協(xié)同決策方法,所有的智能體模塊網(wǎng)絡(luò)會(huì)進(jìn)行參數(shù)共享,獎(jiǎng)勵(lì)通過(guò)平均的方式分配給每個(gè)智能體。方法接收所有智能體的局部觀察作為輸入,然后輸出所有智能體的決策,因此輸入數(shù)據(jù)維度過(guò)大會(huì)給方法訓(xùn)練造成困難。雙向協(xié)調(diào)網(wǎng)絡(luò)(bidirectionally coordinated network, BiCNet)方法[137]通過(guò)一個(gè)基于雙向循環(huán)神經(jīng)網(wǎng)絡(luò)的確定性行動(dòng)者-評(píng)論家(actor-critic, AC)結(jié)構(gòu)來(lái)學(xué)習(xí)多智能體之間的通信協(xié)議,在無(wú)監(jiān)督情況下,可以學(xué)習(xí)各種類型的高級(jí)協(xié)調(diào)策略。集中式訓(xùn)練分散式執(zhí)行為解決多智能體問(wèn)題提供了一種比較通用的框架。反事實(shí)多智能體(counterfactual multi-agent, COMA)方法[138]為了解決Dec-POMDP問(wèn)題中的多智能體信度分配問(wèn)題,即在合作環(huán)境中,聯(lián)合動(dòng)作通常只會(huì)產(chǎn)生全局性的收益,這使得每個(gè)智能體很難推斷出自己對(duì)團(tuán)隊(duì)成功的貢獻(xiàn)。該方法采用反事實(shí)思維,使用一個(gè)反事實(shí)基線,將單個(gè)智能體的行為邊際化,同時(shí)保持其他智能體的行為固定,COMA基于AC實(shí)現(xiàn)了集中訓(xùn)練分散執(zhí)行,適用于合作型任務(wù)。多智能體確定性策略梯度(multi-agent deterministic policy gradient, MADDPG)方法[139]是對(duì)DDPG方法為適應(yīng)多Agent環(huán)境的改進(jìn),最核心的部分就是每個(gè)智能體擁有自己獨(dú)立的AC網(wǎng)絡(luò)和獨(dú)立的回報(bào)函數(shù),critic部分能夠獲取其他所有智能體的動(dòng)作信息,進(jìn)行中心化訓(xùn)練和非中心化執(zhí)行,即在訓(xùn)練的時(shí)候,引入可以觀察全局的critic來(lái)指導(dǎo)訓(xùn)練,而測(cè)試階段便不再有任何通信交流,只使用有局部觀測(cè)的actor采取行動(dòng)。因此,MADDPG方法可以同時(shí)解決協(xié)作環(huán)境、競(jìng)爭(zhēng)環(huán)境以及混合環(huán)境下的多智能體問(wèn)題。多智能體軟Q學(xué)習(xí)(multi-agent soft Q learning, MASQL)[140]方法利用最大熵構(gòu)造軟值函數(shù)來(lái)解決多智能體環(huán)境中的廣泛出現(xiàn)的“相對(duì)過(guò)泛化”引起的最優(yōu)動(dòng)作遮蔽問(wèn)題。

    此外,值分解網(wǎng)絡(luò)(value-decomposition networks, VDN)[141]、Q混合(Q mix, QMIX)[142]、多智能體變分探索(multi-agent variational exploration, MAVEN)[143]、Q變換(Q transformation, QTRAN)[144]等方法采用值函數(shù)分解的思想,按照智能體對(duì)環(huán)境的聯(lián)合回報(bào)的貢獻(xiàn)大小分解全局Q函數(shù),很好地解決了信度分配問(wèn)題,但是現(xiàn)有分解機(jī)制缺乏普適性。VDN方法基于深度并發(fā)Q網(wǎng)絡(luò)(deep recurrent Q-network, DRQN)提出了值分解網(wǎng)絡(luò)架構(gòu),中心化地訓(xùn)練一個(gè)由所有智能體局部的Q網(wǎng)絡(luò)加和得到聯(lián)合的Q網(wǎng)絡(luò),訓(xùn)練完畢后每個(gè)智能體擁有只基于自身局部觀察的Q網(wǎng)絡(luò),可以實(shí)現(xiàn)去中心化執(zhí)行。該方法解耦了智能體之間復(fù)雜的關(guān)系,還解決了由于部分可觀察導(dǎo)致的偽收益和懶惰智能體問(wèn)題。由于VDN求解聯(lián)合價(jià)值函數(shù)時(shí)只是通過(guò)對(duì)單智能體的價(jià)值函數(shù)簡(jiǎn)單求和得到,使得學(xué)到的局部Q值函數(shù)表達(dá)能力有限,無(wú)法表征智能體之間更復(fù)雜的相互關(guān)系,QMIX對(duì)從單智能體價(jià)值函數(shù)到團(tuán)隊(duì)價(jià)值函數(shù)之間的映射關(guān)系進(jìn)行了改進(jìn),在映射的過(guò)程中將原來(lái)的線性映射換為非線性映射,并通過(guò)超網(wǎng)絡(luò)的引入將額外狀態(tài)信息加入到映射過(guò)程,提高了模型性能。MAVEN采用了增加互信息變分探索的方法,通過(guò)引入一個(gè)面向?qū)哟慰刂频碾[層空間來(lái)混合基于值和基于策略的學(xué)習(xí)方法。QTRAN提出了一種更加泛化的值分解方法,從而成功分解任何可分解的任務(wù),但是對(duì)于無(wú)法分解的協(xié)作任務(wù)的問(wèn)題并未涉及。Q行列式戰(zhàn)點(diǎn)過(guò)程(Q determinantal point process, Q-DPP)[145]方法采用行列式點(diǎn)過(guò)程方法度量多樣性,加速策略探索。多智能體近端策略優(yōu)化(multi-agent proximal policy optimization, MAPPO)[146]方法直接采用多個(gè)PPO算法和廣義優(yōu)勢(shì)估計(jì)、觀測(cè)和層歸一化、梯度和值函數(shù)裁剪等實(shí)踐技巧在多類合作場(chǎng)景中表現(xiàn)較好。Shapley Q學(xué)習(xí)方法[147]采用合作博弈理論建模、利用Shapley值來(lái)引導(dǎo)值函數(shù)分析,為信度分配提供了可解釋方案。

    聯(lián)網(wǎng)分散式訓(xùn)練方法是一類利用時(shí)變通信網(wǎng)絡(luò)的學(xué)習(xí)方法,其決策過(guò)程可建模成時(shí)空MDP,智能體位于時(shí)變通信網(wǎng)絡(luò)的節(jié)點(diǎn)上。每個(gè)智能體基于其本地觀測(cè)和連接的臨近智能體提供的信息來(lái)學(xué)習(xí)分散的控制策略,智能體會(huì)得到當(dāng)?shù)鬲?jiǎng)勵(lì)。擬合Q迭代(fitted Q iteration, FQI)[148]方法采用神經(jīng)擬合Q值函數(shù),分布式非精確梯度(distributed inexact gradient, DIGing)[149]方法基于時(shí)變圖拓?fù)涞姆植际絻?yōu)化方法,多智能體AC(multiagent AC, MAAC)[150]方法是基于AC算法提出來(lái)的,每個(gè)智能體都有自己獨(dú)立的actor網(wǎng)絡(luò)和critic網(wǎng)絡(luò),每個(gè)智能體都可以獨(dú)立決策并接收當(dāng)?shù)鬲?jiǎng)勵(lì),同時(shí)在網(wǎng)絡(luò)上與臨近智能體交換信息以得到最佳的全網(wǎng)絡(luò)平均回報(bào),該方法提供了收斂性的保證。由于多智能帶來(lái)的維數(shù)詛咒和解的概念難計(jì)算等問(wèn)題,使得其很具有挑戰(zhàn)性,擴(kuò)展AC(scalable AC, SAC)[151]方法是一種可擴(kuò)展的AC方法,可以學(xué)習(xí)一種近似最優(yōu)的局部策略來(lái)優(yōu)化平均獎(jiǎng)勵(lì),其復(fù)雜性隨局部智能體(而不是整個(gè)網(wǎng)絡(luò))的狀態(tài)-行動(dòng)空間大小而變化。神經(jīng)通信(neural communication, NeurComm)[152]是一種可分解通信協(xié)議,可以自適應(yīng)地共享系統(tǒng)狀態(tài)和智能體行為的信息,該算法的提出是為了減少學(xué)習(xí)中的信息損失和解決非平穩(wěn)性問(wèn)題,為設(shè)計(jì)自適應(yīng)和高效的通信學(xué)習(xí)方法提供了支撐。近似多智能體擬合Q迭代(approximate multiagent fitted Q iteration, AMAFQI)[153]是一種多智能體批強(qiáng)化學(xué)習(xí)的有效逼近方法,其提出的迭代策略搜索對(duì)集中式標(biāo)準(zhǔn)Q函數(shù)的多個(gè)近似產(chǎn)生貪婪策略。

    圍繞聯(lián)網(wǎng)條件下合作性或競(jìng)爭(zhēng)性多智能體強(qiáng)化學(xué)習(xí)問(wèn)題,Zhang等[154]提出了利用值函數(shù)近似的分散式擬合Q迭代方法,合作場(chǎng)景中聯(lián)網(wǎng)智能體團(tuán)隊(duì)以最大化所有智能體獲得的累積折扣獎(jiǎng)勵(lì)的全局平均值為目標(biāo),對(duì)抗場(chǎng)景中兩個(gè)聯(lián)網(wǎng)團(tuán)隊(duì)以零和博弈的納什均衡為目標(biāo)。

    3.1.2 擴(kuò)展式博弈策略學(xué)習(xí)方法

    對(duì)于完美信息的擴(kuò)展式博弈可以通過(guò)線性規(guī)劃等組合優(yōu)化方法來(lái)求解。近年來(lái),由于計(jì)算博弈論在非完美信息博弈領(lǐng)域取得的突破,基于后悔值的方法得到廣泛關(guān)注。當(dāng)前,面向納什均衡、相關(guān)均衡、粗相關(guān)均衡、擴(kuò)展形式相關(guān)均衡的相關(guān)求解方法如表5所示。其中,面向兩人零和博弈的組合優(yōu)化方法主要有線性規(guī)劃(linear programming, LP)[155]、過(guò)大間隙技術(shù)(excessive gap technique, EGT)[156]、鏡像近似(mirror prox, MP)[157]、投影次梯度下降(projected subgradient descent, PSD)[158]、可利用性下降(exploitability descent, ED)[159]等方法,后悔值最小化方法主要有后悔值匹配[160]、CFR[161]、Hedge[162]、乘性權(quán)重更新(multiplicative weight update, MWU)[163]、Hart后悔值匹配[164]等方法。雖然一些研究面向兩人一般和博弈的組合優(yōu)化方法主要有Lemke-Howson[165]、支撐集枚舉混合整數(shù)線性規(guī)劃(support enumeration mixed-integer linear programming, SEMILP)[166]、混合方法[167]、列生成[168]和線性規(guī)劃方法[169],后悔值最小化方法主要有縮放延拓后悔最小化(scaled extension regret minimizer, SERM)[170]。面向多人一般和博弈的組合優(yōu)化方法主要有列生成方法[168]、反希望橢球法(ellipsoid against hope, EAH)[171],后悔值最小化方法主要有后悔值測(cè)試方法、基于采樣的CFR法(CFR-S)[172]和基于聯(lián)合策略重構(gòu)的CFR法(CFR-Jr)[172]等。基于后悔值的方法,其收斂速度一般為O(T-1/2),一些研究借助在線凸優(yōu)化技術(shù)將收斂速度提升到O(T-3/4),這類優(yōu)化方法,特別是一些加速一階優(yōu)化方法理論上可以比后悔值方法更快收斂,但實(shí)際應(yīng)用中效果并不理想。

    在求解大規(guī)模非完全信息兩人零和擴(kuò)展博弈問(wèn)題中,算法博弈論方法與深度強(qiáng)化學(xué)習(xí)方法成效顯著,形成以Pluribus、DeepStack等為代表的高水平德州撲克,在人機(jī)對(duì)抗中超越人類職業(yè)選手水平。其中,CFR類方法通過(guò)計(jì)算累計(jì)后悔值并依據(jù)后悔值匹配方法更新策略。深度強(qiáng)化學(xué)習(xí)類方法通過(guò)學(xué)習(xí)信息集上的值函數(shù)來(lái)更新博弈策略并收斂于近似納什均衡。近年來(lái),一些研究利用Blackwell近似理論[175],構(gòu)建起了在線凸優(yōu)化類方法與后悔值類方法之間的橋梁,F(xiàn)arina等人[176]證明了后悔值最小化(RM)及其變體RM+分別與正則化跟風(fēng)和在線鏡像下降等價(jià),收斂速度為O(T)。此外,一些研究表明后悔值與強(qiáng)化學(xué)習(xí)中的優(yōu)勢(shì)函數(shù)等價(jià)[177],現(xiàn)有強(qiáng)化學(xué)習(xí)方法通過(guò)引入“后悔值”概念,或者后悔值匹配更新方法,形成不同強(qiáng)化學(xué)習(xí)類方法,在提高收斂速率的同時(shí),使得CFR方法的泛化性更強(qiáng)。三大類方法的緊密聯(lián)系為求解大規(guī)模兩人零和非完美信息博弈提供了新方向和新思路。非完美信息博弈求解方法主要有表格式、采樣類、函數(shù)近似和神經(jīng)網(wǎng)絡(luò)等CFR類方法,優(yōu)化方法和強(qiáng)化學(xué)習(xí)類方法,如表6所示。

    基礎(chǔ)的表格類CFR方法受限于后悔值和平均策略的存儲(chǔ)空間限制,只能求解狀態(tài)空間約為1014的博弈問(wèn)題。CFR與抽象、剪枝、采樣、函數(shù)近似、神經(jīng)網(wǎng)絡(luò)估計(jì)等方法結(jié)合,衍生出一系列CFR類方法,試圖從加速收斂速度、減少內(nèi)存占用、縮減博弈樹等,為快速求解近似納什均衡解提供有效支撐。采樣類CFR方法中蒙特卡羅采樣是主流方法,蒙特卡羅CFR(Monte carlo CFR,MCCFR)通過(guò)構(gòu)建生成式對(duì)手,大幅降低迭代時(shí)間、加快收斂速度。此外并行計(jì)算小批次、方差約減等技術(shù)被用于約束累積方差,如圖12所示,各類方法的采樣方式呈現(xiàn)出不同形態(tài)。

    函數(shù)近似與神經(jīng)網(wǎng)絡(luò)類CFR方法主要采用擬合的方法估計(jì)反事實(shí)后悔值、累積后悔值,求解當(dāng)前策略或平均策略,相較于表格類方法泛化性更強(qiáng)。優(yōu)化方法有效利用了數(shù)學(xué)優(yōu)化類工具,將非完美信息博弈問(wèn)題構(gòu)建成雙線性鞍點(diǎn)問(wèn)題,充分利用離線生成函數(shù)、在線凸優(yōu)化方法、梯度估計(jì)與策略探索等方法,在小規(guī)模博弈上收斂速度快,但無(wú)法適應(yīng)空間大的博弈求解,應(yīng)用場(chǎng)景受限。傳遞的強(qiáng)化學(xué)習(xí)方法主要是利用自對(duì)弈的方式生成對(duì)戰(zhàn)經(jīng)驗(yàn)數(shù)據(jù)集,進(jìn)而學(xué)習(xí)魯棒的應(yīng)對(duì)策略,新型的強(qiáng)化學(xué)習(xí)方法將后悔值及可利用性作為強(qiáng)化學(xué)習(xí)的目標(biāo)函數(shù),面向大型博弈空間,由于策略空間的非傳遞性屬性和對(duì)手適變的非平穩(wěn)策略,兩類方法均面臨探索與利用難題。當(dāng)前,以CFR為代表的算法博弈論方法已經(jīng)取得了突破,優(yōu)化方法及強(qiáng)化學(xué)習(xí)方法的融合為設(shè)計(jì)更具泛化能力的方法提供了可能。

    對(duì)于多人博弈,一類針對(duì)對(duì)抗團(tuán)隊(duì)博弈[209]模型得到了廣泛研究,其中團(tuán)隊(duì)最大最小均衡(team-maxmin equilibrium, TME)描述了一個(gè)擁有相同效用的團(tuán)隊(duì)與一個(gè)對(duì)手博弈對(duì)抗的解概念。針對(duì)智能體之間有無(wú)通信、有無(wú)事先通信、可否事中通信等情形,近年來(lái)的一些研究探索了相關(guān)解概念,如相關(guān)TME(correlated TME, CTME)、帶協(xié)同設(shè)備的TME (TME with coordination device, TMECor)、帶通信設(shè)備的TME (TME with communication device, TMECom),相關(guān)均衡求解方法,如增量策略生成[210],其本質(zhì)是一類雙重預(yù)言機(jī)(double oracle, DO)方法,如表7所示,Zhang結(jié)合網(wǎng)絡(luò)阻斷應(yīng)用場(chǎng)景設(shè)計(jì)了多種對(duì)抗團(tuán)隊(duì)博弈求解方法[119]。此外,還有團(tuán)隊(duì)-對(duì)手博弈模型也被用來(lái)建模多對(duì)一的博弈情形。

    3.1.3 元博弈種群策略學(xué)習(xí)方法

    對(duì)于多智能體博弈策略均衡學(xué)習(xí)問(wèn)題,近年來(lái)一些通用的框架相繼被提出,其中關(guān)于元博弈理論的學(xué)習(xí)框架為多智能體博弈策略的學(xué)習(xí)提供了指引。由于問(wèn)題的復(fù)雜性,多智能體博弈策略學(xué)習(xí)表現(xiàn)出基礎(chǔ)策略可以通過(guò)強(qiáng)化學(xué)習(xí)等方法很快生成,而較優(yōu)策略依靠在已生成的策略池中緩慢迭代產(chǎn)生。當(dāng)前由強(qiáng)化學(xué)習(xí)支撐的策略快速生成“內(nèi)環(huán)學(xué)習(xí)器”和演化博弈理論支撐的種群策略緩慢迭代“外環(huán)學(xué)習(xí)器”組合成的“快與慢”雙環(huán)優(yōu)化方法[214],為多智能體博弈策略學(xué)習(xí)提供了基本參考框架。Lanctot等人[215]提出了面向多智能體強(qiáng)化學(xué)習(xí)的策略空間響應(yīng)預(yù)言機(jī)(policy space response oracle, PSRO)統(tǒng)一博弈學(xué)習(xí)框架,成功將雙重預(yù)言機(jī)這類迭代式增量式策略生成方法擴(kuò)展成滿足元博弈種群策略學(xué)習(xí)方法,其過(guò)程本質(zhì)上由兩個(gè)步驟組成“挑戰(zhàn)對(duì)手”和“響應(yīng)對(duì)手”。為了應(yīng)對(duì)一般和博弈,Muller等人[216]提出了基于α-排名和PSRO的通用學(xué)習(xí)方法框架。Sun等人[217]提出了滿足競(jìng)爭(zhēng)自對(duì)弈多智能體強(qiáng)化學(xué)習(xí),提出了分布式聯(lián)賽學(xué)習(xí)框架T聯(lián)賽(TLeague),可以云服務(wù)架構(gòu)組織多智能體博弈策略學(xué)習(xí)。Zhou等人[218]基于種群多智能體強(qiáng)化學(xué)習(xí)提出了融合策略評(píng)估的多智能體庫(kù)(multi-agent library, MALib)并行學(xué)習(xí)框架。當(dāng)前多智能體博弈策略學(xué)習(xí)主要是通過(guò)算法驅(qū)動(dòng)仿真器快速生成博弈對(duì)抗樣本,如圖13所示,得到收益張量M,元博弈求解器計(jì)算策略組合分布,進(jìn)而輔助挑戰(zhàn)下一輪對(duì)戰(zhàn)對(duì)手(末輪單個(gè)、最強(qiáng)k個(gè)、均勻采樣等),預(yù)言機(jī)主要負(fù)責(zé)生成最佳響應(yīng),為智能體的策略空間增加新策略。

    (1) 策略評(píng)估方法

    多智能體博弈對(duì)抗過(guò)程中,由基礎(chǔ)“內(nèi)環(huán)學(xué)習(xí)器”快速生成的智能體模型池里,各類模型的能力水平各不相同,如何評(píng)估其能力用于外層的最優(yōu)博弈策略模型探索可以看作是一個(gè)多智能體交互機(jī)制設(shè)計(jì)問(wèn)題,即如何按能力挑選智能體用于“外環(huán)學(xué)習(xí)器”策略探索。當(dāng)前,衡量博弈策略模型絕對(duì)能力的評(píng)估方法主要有可利用性[219]、方差[220]和保真性[221]等。此外,Cloud等人[220]采用三支分解方法度量智能體的技能、運(yùn)氣與非平穩(wěn)性等。

    衡量相對(duì)能力的評(píng)估方法已經(jīng)成為當(dāng)前的主流[222]。由于博弈策略類型的不同,評(píng)估方法的適用也不盡相同。當(dāng)前策略評(píng)估方法主要分傳遞性壓制博弈和循環(huán)性壓制博弈策略評(píng)估方法。面向傳遞性壓制博弈的Elo[223]、Glicko[224]和真實(shí)技能[225]方法;面向循環(huán)性壓制博弈的多維Elo(multidimensional Elo, mElo)[75]、納什平均[75]、α-排名[226-227]、響應(yīng)圖上置信界采樣(response graph-upper confidence bound, RG-UCB)[228]、基于信息增益的α排名(α information gain, αIG)[229],最優(yōu)評(píng)估(optimal eva-luation, OptEval)[230]等方法,各類方法相關(guān)特點(diǎn)如表8所示。

    此外,最新的一些研究對(duì)智能體與任務(wù)的適配度[75]、游戲難度[231]、選手排名[232]、方法的性能[233]、在線評(píng)估[234]、大規(guī)模評(píng)估[235-236]、團(tuán)隊(duì)聚合技能評(píng)估[237]等問(wèn)題展開了探索。通過(guò)策略評(píng)估,可以掌握種群中對(duì)手能力情況及自身能力等級(jí),快速的評(píng)估方法可有效加快多樣性策略的探索速度。

    (2) 策略提升方法

    在“內(nèi)環(huán)學(xué)習(xí)器”完成了智能體博弈策略評(píng)估的基礎(chǔ)上,“外環(huán)學(xué)習(xí)器”需要通過(guò)與不同的“段位”的智能體進(jìn)行對(duì)抗,提升策略水平。傳統(tǒng)自對(duì)弈的方法對(duì)非傳遞壓制性博弈的策略探索作用不明顯。由于問(wèn)題的復(fù)雜性,多智能體博弈策略的迭代提升需要一些新的方法模型,特別是需要能滿足策略提升的種群訓(xùn)練方法。博弈策略提升的主要方法有自對(duì)弈(self-play, SP)[238]、協(xié)同對(duì)弈(co-play, CP)[239]、FSP[240]和種群對(duì)弈(population play, PP)[241]等多類方法,如圖14所示。但各類方法的適用有所區(qū)分,研究表明僅當(dāng)策略探索至種群數(shù)量足夠多、多樣性滿足條件后,這類迭代式學(xué)習(xí)過(guò)程才能產(chǎn)生相變,傳統(tǒng)的自對(duì)弈方法只有當(dāng)策略的“傳遞壓制維”上升到一定段位水平后才可能有作用,否則可能陷入循環(huán)壓制策略輪替生成。

    根據(jù)適用范圍分類,可以將方法劃分成自對(duì)弈、協(xié)同對(duì)弈、虛擬對(duì)弈和種群對(duì)弈共四大類,如表9所示。

    自對(duì)弈類方法主要有樸素自對(duì)弈方法[242],δ-Uniform自對(duì)弈[243]、非對(duì)稱自對(duì)弈[244]、雙重預(yù)言機(jī)[245]、極小極大后悔魯棒預(yù)言機(jī)[246]、無(wú)偏自對(duì)弈[247]等。這類方法主要利用與自身的歷史版本對(duì)抗生成訓(xùn)練樣本,對(duì)樣本的質(zhì)量要求高,適用范圍最小。

    虛擬對(duì)弈類方法主要有虛擬對(duì)弈[248]、虛擬自對(duì)弈[203]、廣義虛擬對(duì)弈[197]、擴(kuò)展虛擬對(duì)弈[249]、平滑虛擬對(duì)弈[250]、隨機(jī)虛擬對(duì)弈[251]、團(tuán)隊(duì)虛擬對(duì)弈[252]、神經(jīng)虛擬自對(duì)弈[253]、蒙特卡羅神經(jīng)虛擬自對(duì)弈[204]、優(yōu)先級(jí)虛擬自對(duì)弈[16]等。這類方法是自對(duì)弈方法的升級(jí)版本,由于樣本空間大,通常會(huì)與采樣或神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)類方法結(jié)合使用,可用于擴(kuò)展式博弈、團(tuán)隊(duì)博弈等場(chǎng)景。其中,AlphaStar采用的聯(lián)賽訓(xùn)練機(jī)制正是優(yōu)先級(jí)虛擬自對(duì)弈方法,智能體策略集中包含三大類:主智能體、主利用者和聯(lián)盟利用者。此外,星際爭(zhēng)霸在優(yōu)先級(jí)虛擬自對(duì)弈的基礎(chǔ)上增加了智能體分支模塊,TStarBot-X采用了多樣化聯(lián)賽訓(xùn)練。

    協(xié)同對(duì)弈方法主要有協(xié)同演化[255]、協(xié)同學(xué)習(xí)[29]等,這類方法主要依賴多個(gè)策略協(xié)同演化生成下一世代的優(yōu)化策略。

    種群對(duì)弈方法主要有種群訓(xùn)練自對(duì)弈[254]、雙重預(yù)言機(jī)-經(jīng)驗(yàn)博弈分析[256]、混合預(yù)言機(jī)/混合對(duì)手[257-258]、PSRO[215]、聯(lián)合PSRO[259]、行列式點(diǎn)過(guò)程PSRO[254]、管線PSRO[260]、在線PSRO[261]和自主PSRO[262]、任意時(shí)間最優(yōu)PSRO[263]、有效PSRO[264]、神經(jīng)種群學(xué)習(xí)[265]等多類方法,這類方法與分布式框架的組合為當(dāng)前絕大部分多智能體博弈問(wèn)題提供了通用解決方案,其關(guān)鍵在于如何提高探索樣本效率,確保快速的內(nèi)環(huán)能有效生成策略樣本,進(jìn)而加快慢外環(huán)的優(yōu)化迭代。

    (3) 自主學(xué)習(xí)方法

    近年來(lái),一些研究試圖從算法框架與分布式計(jì)算框架進(jìn)行創(chuàng)新,借助元學(xué)習(xí)方法,將策略評(píng)估與策略提升方法融合起來(lái)。Feng等人[262]基于元博弈理論、利用元學(xué)習(xí)方法探索了多樣性感知的自主課程學(xué)習(xí)方法,通過(guò)自主發(fā)掘多樣性課程用于難被利用策略的探索。Yang等人[266]指出多樣性自主課程學(xué)習(xí)對(duì)現(xiàn)實(shí)世界里的多智能體學(xué)習(xí)系統(tǒng)非常關(guān)鍵。Wu等人[267]利用元學(xué)習(xí)方法同時(shí)可以生成難被利用和多樣性對(duì)手,引導(dǎo)智能體自身策略迭代提升。Leibo等人[268]研究指出自主課程學(xué)習(xí)是研究多智能體智能的可行方法,課程可由外生和內(nèi)生挑戰(zhàn)自主生成。

    當(dāng)前自主學(xué)習(xí)類方法需要利用多樣性[269]策略來(lái)加速策略空間的探索,其中有質(zhì)多樣性[270]作為一類帕累托框架,因其同時(shí)確保了對(duì)結(jié)果空間的廣泛覆蓋和有效的回報(bào),為平衡處理“探索與利用”問(wèn)題提供了目標(biāo)導(dǎo)向。

    當(dāng)前對(duì)多樣性的研究主要區(qū)分為三大類:行為多樣性[271]、策略多樣性[269]、環(huán)境多樣性[271]。一些研究擬采用矩陣范數(shù)(如L1,1范數(shù)[67]、F范數(shù)和譜范數(shù)[67]、行列式值[254, 272])、有效測(cè)度[272]、最大平均差異[273]、占據(jù)測(cè)度[269]、期望基數(shù)[254]、凸胞擴(kuò)張[269]等衡量多樣性,如表10所示。其中,行為多樣性可引導(dǎo)智能體更傾向于采取多樣化的行動(dòng),策略多樣性可引導(dǎo)智能體生成差異化的策略、擴(kuò)大種群規(guī)模、提高探索效率,環(huán)境多樣性可引導(dǎo)智能體適變更多不同的場(chǎng)景,增強(qiáng)智能體的適變能力。

    3.2 在線場(chǎng)景博弈策略學(xué)習(xí)方法

    由離線學(xué)習(xí)得到的博弈策略通常被稱作藍(lán)圖策略。在線對(duì)抗過(guò)程中,可完全依托離線藍(lán)圖策略進(jìn)行在線微調(diào),如即時(shí)策略游戲中依據(jù)情境元博弈選擇對(duì)抗策略[275]。棋牌類游戲中可以用兩種方式生成己方策略, 即從悲觀視角出發(fā)的博弈最優(yōu), 即采用離線藍(lán)圖策略進(jìn)行對(duì)抗。從樂(lè)觀視角出發(fā)的剝削式對(duì)弈, 即在線發(fā)掘?qū)κ挚赡艿娜觞c(diǎn), 最大化己方收益的方式利用對(duì)手。正是由于難以應(yīng)對(duì)非平穩(wěn)對(duì)手的策略動(dòng)態(tài)切換[276]、故意隱藏或欺騙,在線博弈過(guò)程中通常需要及時(shí)根據(jù)對(duì)手表現(xiàn)和所處情境進(jìn)行適應(yīng)性調(diào)整,其本質(zhì)是一個(gè)對(duì)手意圖識(shí)別與反制策略生成[275]問(wèn)題。當(dāng)前在線博弈策略學(xué)習(xí)的研究主要包括學(xué)會(huì)控制后悔值[277]、對(duì)手建模與利用[35]、智能體匹配及協(xié)作[278]。

    3.2.1 在線優(yōu)化與無(wú)悔學(xué)習(xí)

    在線決策過(guò)程的建模方法主要有在線MDP[279]、對(duì)抗MDP[280]、未知部分可觀MDP[281]、未知MG[282]等?;谠诰€優(yōu)化與無(wú)悔學(xué)習(xí)方法的融合是在線博弈策略學(xué)習(xí)的重點(diǎn)研究方向,其中無(wú)悔本是指隨著交互時(shí)長(zhǎng)趨近無(wú)窮大時(shí),后悔值呈亞線性遞減,即滿足O(T-1/2)。傳統(tǒng)的無(wú)悔學(xué)習(xí)方法主要依賴Hedge[162]和MWU[163]等,近來(lái)的一些研究利用在線凸優(yōu)化方法設(shè)計(jì)了基于跟隨正則化領(lǐng)先者[176]和在線鏡像下降[200]等樂(lè)觀后悔最小化算法。

    Dinh等人[261]利用Hedge方法和策略支撐集數(shù)量約束,證明了線動(dòng)態(tài)后悔值的有界性。Kash等人[283]將無(wú)悔學(xué)習(xí)與Q值函數(shù)結(jié)合設(shè)計(jì)了一種局部無(wú)悔學(xué)習(xí)方法,無(wú)需考慮智能體的完美回憶條件仍可收斂。Lin[284]和Lee[285]等人對(duì)無(wú)悔學(xué)習(xí)的有限時(shí)間末輪迭代收斂問(wèn)題展開了研究,通過(guò)附加正則化項(xiàng)的樂(lè)觀后悔值最小化方法收斂速度更快。Daskalakis等人[286]研究了幾類面向一般和博弈的近似最優(yōu)無(wú)悔學(xué)習(xí)方法的后悔界。此外,事后理性[287]作為一個(gè)與后悔值等效的可替代學(xué)習(xí)目標(biāo),可用于引導(dǎo)在線學(xué)習(xí)與其他智能體關(guān)聯(lián)的最佳策略。

    3.2.2 對(duì)手建模與利用方法

    通過(guò)對(duì)手建模可以合理地預(yù)測(cè)對(duì)手的行動(dòng)、發(fā)掘隊(duì)手的弱點(diǎn)以備利用。當(dāng)前,對(duì)手建模方法主要分兩大類:與博弈領(lǐng)域知識(shí)關(guān)聯(lián)比較密切的顯式建模方法和面向策略的隱式建模方法。面向在線策略學(xué)習(xí)的對(duì)手利用方法主要有以下三大類。

    (1) 對(duì)手判別式適變方法

    Li[288]提出利用模式識(shí)別樹顯式的構(gòu)建對(duì)手模型,估計(jì)對(duì)手策略與贏率進(jìn)而生成己方反制策略,Ganzfried等人[289]設(shè)計(jì)機(jī)會(huì)發(fā)掘方法,試圖利用對(duì)手暴露的弱點(diǎn)。Davis等人[290]通過(guò)估計(jì)對(duì)手信息,構(gòu)建限定性條件,加快約束策略生成。

    (2) 對(duì)手近似式學(xué)習(xí)方法

    Wu等人[267]利用元學(xué)習(xí)生成難被剝削對(duì)手和多樣性對(duì)手模型池來(lái)指引在線博弈策略學(xué)習(xí)。Kim等人[291]利用對(duì)手建模與元學(xué)習(xí)設(shè)計(jì)了面向多智能體的元策略優(yōu)化方法。Foerster等人[107]設(shè)計(jì)的對(duì)手察覺學(xué)習(xí)方法是一類考慮將對(duì)手納入己方策略學(xué)習(xí)過(guò)程中的學(xué)習(xí)方法。Silva等人[292]提出的在線自對(duì)弈課程方法通過(guò)在線構(gòu)建對(duì)抗課程引導(dǎo)博弈策略學(xué)習(xí)。

    (3) 對(duì)手生成式搜索方法

    Ganzfried等人[289]提出基于狄利克雷先驗(yàn)對(duì)手模型,利用貝葉斯優(yōu)化模型獲得對(duì)手模型的后驗(yàn)分布,輔助利用對(duì)手的反制策略生成。Sustr等人[293]提出利用基于信息集蒙特卡羅采樣的蒙特卡羅重解法生成反制策略。Brown等人[294]提出在對(duì)手建模時(shí)要平衡安全與可利用性,基于安全嵌套有限深度搜索的方法可以生成安全對(duì)手利用的反制策略。Tian[295]提出利用狄利克雷先驗(yàn), 基于餐館過(guò)程在博弈策略空間中生成安全利用對(duì)手的反制策略。

    3.2.3 角色匹配與臨機(jī)協(xié)調(diào)

    多智能體博弈通常是在多角色協(xié)調(diào)配合下完成的,通常同類角色可執(zhí)行相似的任務(wù),各類智能體之間的臨機(jī)協(xié)調(diào)是博弈對(duì)抗致勝的關(guān)鍵。Wang等人[296]設(shè)計(jì)了面向多類角色的多智能體強(qiáng)化學(xué)習(xí)框架,通過(guò)構(gòu)建一個(gè)隨機(jī)角色嵌入空間,可以學(xué)習(xí)特定角色、動(dòng)態(tài)角色和可分辨角色,相近角色的單元完成相似任務(wù),加快空間劃分與環(huán)境高效探索。Gong等人[297]利用角色(英雄及玩家)向量化方法分析了英雄之間的高階交互情況,圖嵌入的方式分析了協(xié)同與壓制關(guān)系,研究了多智能體匹配在線規(guī)劃問(wèn)題。

    臨機(jī)組隊(duì)可以看作是一個(gè)機(jī)制設(shè)計(jì)問(wèn)題[278]。Hu等人[298]提出了智能體首次合作的零樣本協(xié)調(diào)問(wèn)題,利用其對(duì)弈[299]方法(即基于學(xué)習(xí)的人工智能組隊(duì)方法)為無(wú)預(yù)先溝通的多智能體協(xié)調(diào)學(xué)習(xí)提供了有效支撐。此外,人與人工智能組隊(duì)作為臨機(jī)組隊(duì)問(wèn)題的子問(wèn)題,要求人工智能在不需要預(yù)先協(xié)調(diào)下可與人在線協(xié)同。Lucero等人[300]利用StarCraft平臺(tái)研究了如何利用人機(jī)組隊(duì)和可解釋人工智能技術(shù)幫助玩家理解系統(tǒng)推薦的行動(dòng)。Waytowich等人[301]研究了如何運(yùn)用自然語(yǔ)言指令驅(qū)動(dòng)智能體學(xué)習(xí),基于語(yǔ)言指令與狀態(tài)的互嵌入模型實(shí)現(xiàn)了人在環(huán)路強(qiáng)化學(xué)習(xí)方法的設(shè)計(jì)。Siu等人[302]利用一類合作博弈平臺(tái)Hanabi評(píng)估了各類人與人工智能組隊(duì)方法的效果。

    4 多智能體博弈學(xué)習(xí)前沿展望

    4.1 智能體認(rèn)知行為建模與協(xié)同

    4.1.1 多模態(tài)行為建模

    構(gòu)建智能體的認(rèn)知行為模型為一般性問(wèn)題提供求解方法,是獲得通用人工智能的一種探索。各類認(rèn)知行為模型框架[303]為智能體獲取知識(shí)提供了接口。對(duì)抗環(huán)境下,智能體的認(rèn)知能力主要包含博弈推理與反制策略生成[195]、對(duì)抗推理與對(duì)抗規(guī)劃[304]。認(rèn)知行為建??蔀榉治鰧?duì)手思維過(guò)程、決策行動(dòng)的動(dòng)態(tài)演化、欺騙與反欺騙等認(rèn)知對(duì)抗問(wèn)題提供支撐。智能體行為的多模態(tài)屬性[305],如合作場(chǎng)景下行為的“解釋性、明確性、透明性和預(yù)測(cè)性”,對(duì)抗場(chǎng)景下行為的“欺騙性、混淆性、含糊性、隱私性和安全性”,均是欺騙性和可解釋性認(rèn)知行為建模的重要研究?jī)?nèi)容,相關(guān)技術(shù)可應(yīng)用于智能人機(jī)交互、機(jī)器推理、協(xié)同規(guī)劃、具人類意識(shí)智能系統(tǒng)等領(lǐng)域問(wèn)題的求解。

    4.1.2 對(duì)手推理與適變

    傳統(tǒng)的對(duì)手建模方法一般會(huì)假設(shè)對(duì)手策略平衡不變、固定策略動(dòng)態(tài)切換等簡(jiǎn)單情形,但對(duì)手建模仍面臨對(duì)手策略非平穩(wěn)、風(fēng)格驟變、對(duì)抗學(xué)習(xí)、有限理性、有限記憶、欺騙與詐唬等挑戰(zhàn)。當(dāng)前,具對(duì)手意識(shí)的學(xué)習(xí)[107]、基于心智理論(認(rèn)知層次理論)的遞歸推理[105]和基于策略蒸餾和修正信念的貝葉斯策略重用[306]等方法將對(duì)手推理模板嵌入對(duì)手建模流程中,可有效應(yīng)對(duì)非平穩(wěn)對(duì)手。此外,在線博弈對(duì)抗過(guò)程中,公共知識(shí)與完全理性等條件均可能無(wú)法滿足,對(duì)手缺點(diǎn)的暴露強(qiáng)化了智能體偏離均衡解的動(dòng)機(jī),基于納什均衡解采用安全適變策略可有效剝削對(duì)手[294]且不易被發(fā)覺[307]。

    4.1.3 人在環(huán)路協(xié)同

    “人機(jī)對(duì)抗”是當(dāng)前檢驗(yàn)人工智能AI的主流評(píng)測(cè)方式,而“人機(jī)協(xié)同”是人機(jī)混合智能的主要研究?jī)?nèi)容。人與AI的協(xié)同可區(qū)分為人在環(huán)路內(nèi)、人在環(huán)路上和人在環(huán)路外共3種模式,其中人在環(huán)路上(人可參與干預(yù),也可旁觀監(jiān)督)的相關(guān)研究是當(dāng)前的研究重點(diǎn),特別是基于自然語(yǔ)言指令的相關(guān)研究為人機(jī)交互預(yù)留了更為自然的人機(jī)交互方式[301]。此外,圍繞“人(博弈局中人)—機(jī)(機(jī)器人工智能)—環(huán)(博弈對(duì)抗環(huán)境)”協(xié)同演化的相關(guān)研究表明,人機(jī)協(xié)同面臨著應(yīng)用悖論,人機(jī)組隊(duì)后的能力將遠(yuǎn)超人類或機(jī)器,但過(guò)度依賴人工智能將會(huì)使人類的技能退化,盲目樂(lè)觀的應(yīng)用,忽視缺陷和漏洞,對(duì)抗中被欺騙可至決策錯(cuò)誤,推薦的行動(dòng)方案受質(zhì)疑,在某些人道主義應(yīng)用場(chǎng)景中可能面臨倫理挑戰(zhàn)。

    4.2 通用博弈策略學(xué)習(xí)方法

    4.2.1 大規(guī)模智能體學(xué)習(xí)方法

    當(dāng)前多智能體博弈的相關(guān)研究正向多智能體集群對(duì)抗、異構(gòu)集群協(xié)同等高復(fù)雜現(xiàn)實(shí)及通用博弈場(chǎng)景聚焦。隨著智能體數(shù)量規(guī)模的增加,行動(dòng)和狀態(tài)空間將呈指數(shù)級(jí)增長(zhǎng),在很大程度上限制了多智能體學(xué)習(xí)方法的可擴(kuò)展性。傳統(tǒng)的博弈抽象[308]、狀態(tài)及行動(dòng)抽象[309]方法雖然可以對(duì)問(wèn)題空間做有效約減,但問(wèn)題的復(fù)雜度依然很高,在智能體數(shù)目N2時(shí),納什均衡通常很難計(jì)算,多人博弈均衡解存在性和求解依然充滿挑戰(zhàn)。Yang等人[310]根據(jù)平均場(chǎng)思想提出的平均場(chǎng)Q學(xué)習(xí)和平均場(chǎng)AC方法,為解決大規(guī)模智能體學(xué)習(xí)問(wèn)題提供了參考。

    4.2.2 雙層優(yōu)化自對(duì)弈方法

    博弈策略學(xué)習(xí)的范式正從傳統(tǒng)的“高質(zhì)量樣本模仿學(xué)習(xí)+分布式強(qiáng)化學(xué)習(xí)”向“無(wú)先驗(yàn)知識(shí)+端到端競(jìng)爭(zhēng)式自對(duì)弈學(xué)習(xí)”轉(zhuǎn)變。此前,Muller等人[216]提出的α-Rank和PSRO學(xué)習(xí)方法是一類元博弈種群策略學(xué)習(xí)通用框架方法。Leibo等人[268]從“問(wèn)題的問(wèn)題”視角提出了面向多智能體的“自主課程學(xué)習(xí)”方法。傳統(tǒng)的強(qiáng)化學(xué)習(xí)和算法博弈論方法是多智能體博弈策略學(xué)習(xí)方法的通用基礎(chǔ)學(xué)習(xí)器,基于“快與慢”理念的雙層優(yōu)化類方法[311],其中元學(xué)習(xí)[262]、自主課程學(xué)習(xí)[268]和元演化學(xué)習(xí)[312]、支持并行分布式計(jì)算的無(wú)導(dǎo)數(shù)演化策略學(xué)習(xí)方法[313]、面向連續(xù)博弈的策略梯度優(yōu)化方法[314]、面向非平穩(wěn)環(huán)境的持續(xù)學(xué)習(xí)方法[315]、由易到難的自步學(xué)習(xí)方法[316]為自主策略探索學(xué)習(xí)程序算法設(shè)計(jì)提供了指引。

    4.2.3 知識(shí)與數(shù)據(jù)融合方法

    基于常識(shí)知識(shí)與領(lǐng)域?qū)<一驅(qū)I(yè)人類玩家經(jīng)驗(yàn)的知識(shí)驅(qū)動(dòng)型智能體策略具有較強(qiáng)的可解釋性,而基于大樣本采樣和神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)的數(shù)據(jù)驅(qū)動(dòng)型智能體策略通常具有很強(qiáng)的泛化性。相關(guān)研究從加性融合與主從融合[317]、知識(shí)牽引與數(shù)據(jù)數(shù)據(jù)驅(qū)動(dòng)[318]、層次化協(xié)同與組件化協(xié)同[319]等角度進(jìn)行了探索。此外,張馭龍等人[320]面向任務(wù)級(jí)兵棋提出了多智能體策略協(xié)同演進(jìn)框架,打通人類專家與智能算法之間的知識(shí)循環(huán)。

    4.2.4 離線預(yù)訓(xùn)練與在線微調(diào)方法

    基于海量數(shù)據(jù)樣本的大型預(yù)訓(xùn)練模型是通用人工智能的一種探索。相對(duì)于基于藍(lán)圖策略的在線探索方法,基于離線預(yù)訓(xùn)練模型的在線微調(diào)方法有著更廣泛的應(yīng)用前景。近來(lái),基于序貫決策Transformer[321]的離線[322]與在線[323]學(xué)習(xí)方法將注意力機(jī)制與強(qiáng)化學(xué)習(xí)方法融合,為大型預(yù)訓(xùn)練模型生成提供了思路,來(lái)自DeepMind的Mathieu等人[324]設(shè)計(jì)了面向星際爭(zhēng)霸的超大型離線強(qiáng)化學(xué)習(xí)模型。

    4.3 分布式博弈策略學(xué)習(xí)框架

    4.3.1 多智能體博弈基準(zhǔn)環(huán)境

    當(dāng)前,大多數(shù)博弈對(duì)抗平臺(tái)采用了游戲設(shè)計(jì)的思想,將玩家的參與度作為設(shè)計(jì)目標(biāo),通常會(huì)為了游戲的平衡性,將對(duì)抗多方的能力水平設(shè)計(jì)成相對(duì)均衡狀態(tài)(如星際爭(zhēng)霸中的3個(gè)種族之間相對(duì)狀態(tài)),這類環(huán)境可看成是近似對(duì)稱類環(huán)境。Hernandez等人[62]利用元博弈研究了競(jìng)爭(zhēng)性多玩家游戲的自平衡問(wèn)題。當(dāng)前,一些研究包括星際爭(zhēng)霸多智能體挑戰(zhàn)(starcraft multi-agent challenge, SMAC)[325]、OpenSpiel[326]等基準(zhǔn)環(huán)境,PettingZoo[327]、MAVA[328]等集成環(huán)境。兵棋推演作為一類典型的非對(duì)稱部分可觀異步多智能體協(xié)同對(duì)抗環(huán)境[329],紅藍(lán)雙方通常能力差異明顯,模擬真實(shí)環(huán)境的隨機(jī)性使得決策風(fēng)險(xiǎn)高[317],可以作為多智能體博弈學(xué)習(xí)的基準(zhǔn)測(cè)試環(huán)境。

    4.3.2 分布式強(qiáng)化學(xué)習(xí)框架

    由于學(xué)習(xí)類方法本質(zhì)上采用了試錯(cuò)機(jī)制,需要并行采樣大量多樣化樣本提升訓(xùn)練質(zhì)量,需要依賴強(qiáng)大的計(jì)算資源?;趩l(fā)式聯(lián)賽訓(xùn)練的AlphaStar,需要訓(xùn)練多個(gè)種群才能有效引導(dǎo)策略提升、算法收斂,基于博弈分解的Pluribus,其藍(lán)圖策略的離線訓(xùn)練需要依靠超級(jí)計(jì)算機(jī)集群。當(dāng)前的一些研究提出利用Ray[330]、可擴(kuò)展高效深度強(qiáng)化學(xué)習(xí)(scalable efficent deep reinforcement learning, SEED)[331]、Flatland[332]等分布式強(qiáng)化學(xué)習(xí)框架。

    4.3.3 元博弈種群策略學(xué)習(xí)框架

    元博弈種群策略學(xué)習(xí)框架的設(shè)計(jì)需要將種群策略演化機(jī)制設(shè)計(jì)與分布式計(jì)算平臺(tái)資源調(diào)度協(xié)同考慮。當(dāng)前絕大多數(shù)機(jī)器博弈人工智能的實(shí)現(xiàn)均需要依靠強(qiáng)大的分布式算力支撐。基于元博弈的種群演化自主學(xué)習(xí)方法與分布式學(xué)習(xí)框架的結(jié)合可用于構(gòu)建通用的博弈策略學(xué)習(xí)框架。當(dāng)前,基于競(jìng)爭(zhēng)式自對(duì)弈的TLeague[333]和整體設(shè)計(jì)了策略評(píng)估的MAlib[334]等為種群策略學(xué)習(xí)提供了分布式并行學(xué)習(xí)框架支撐。

    5 結(jié)束語(yǔ)

    本文從博弈論視角,分析了多智能體學(xué)習(xí)。首先,簡(jiǎn)要介紹了多智能體學(xué)習(xí),主要包括多智能體學(xué)習(xí)系統(tǒng)組成、概述、研究方法分類。其次,重點(diǎn)介紹了多智能體博弈學(xué)習(xí)框架,包括基礎(chǔ)模型和元博弈模型、博弈解概念及博弈動(dòng)力學(xué),多智能體博弈學(xué)習(xí)面臨的挑戰(zhàn)。圍繞多智能體博弈策略學(xué)習(xí)方法,重點(diǎn)剖析了策略學(xué)習(xí)框架、離線博弈策略學(xué)習(xí)方法和在線博弈策略學(xué)習(xí)方法。基于梳理的多智能體博弈學(xué)習(xí)方法,指出下一步工作可以著重從“智能體認(rèn)知行為建模、通用博弈策略學(xué)習(xí)方法、分布式策略學(xué)習(xí)框架”等三方面開展多智能體博弈學(xué)習(xí)前沿相關(guān)工作研究。

    參考文獻(xiàn)

    [1] 黃凱奇, 興軍亮, 張俊格, 等. 人機(jī)對(duì)抗智能技術(shù)[J]. 中國(guó)科學(xué): 信息科學(xué), 2020, 50(4): 540-550.

    HUANG K Q, XING J L, ZHANG J G, et al. Intelligent technologies of human-computer gaming[J]. Scientia Sinica Informationics 2020, 50(4): 540-550.

    [2] 譚鐵牛. 人工智能: 用AI技術(shù)打造智能化未來(lái)[M]. 北京: 中國(guó)科學(xué)技術(shù)出版社, 2019.

    TAN T N. Artificial intelligence: building an intelligent future with AI technologies[M]. Beijing: China Science and Technology Press, 2019.

    [3] WOOLDRIDGE M. An introduction to multiagent systems[M]. Florida: John Wiley amp; Sons, 2009.

    [4] SHOHAM Y, LEYTON-BROWN K. Multiagent systems-algorithmic, game-theoretic, and logical foundations[M]. New York: Cambridge University Press, 2009.

    [5] MULLER J P, FISCHER K. Application impact of multi-agent systems and technologies: a survey[M]. SHEHORY O, STURM A. Agent-oriented software engineering. Heidelberg: Springer, 2014: 27-53.

    [6] TURING A M. Computing machinery and intelligence[M]. Berlin: Springer, 2009.

    [7] OMIDSHAFIEI S, TUYLS K, CZARNECKI W M, et al. Navigating the landscape of multiplayer games[J]. Nature Communications, 2020, 11(1): 5603.

    [8] TUYLS K, STONE P. Multiagent learning paradigms[C]∥Proc.of the European Conference on Multi-Agent Systems and Agreement Technologies, 2017: 3-21.

    [9] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676): 354-359.

    [10] SCHRITTWIESER J, ANTONOGLOU I, HUBERT T, et al. Mastering Atari, Go, Chess and Shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609.

    [11] MORAVCIK M, SCHMID M, BURCH N, et al. DeepStack: expert-level artificial intelligence in heads-up no-limit poker[J]. Science, 2017, 356(6337): 508-513.

    [12] BROWN N, SANDHOLM T. Superhuman AI for multiplayer poker[J]. Science, 2019, 365(6456): 885-890.

    [13] JIANG Q Q, LI K Z, DU B Y, et al. DeltaDou: expert-level Doudizhu AI through self-play[C]∥Proc.of the 28th International Joint Conference on Artificial Intelligence, 2019: 1265-1271.

    [14] ZHAO D C, XIE J R, MA W Y, et al. DouZero: mastering Doudizhu with self-play deep reinforcement learning[C]∥Proc.of the 38th International Conference on Machine Learning, 2021: 12333-12344.

    [15] LI J J, KOYAMADA S, YE Q W, et al. Suphx: mastering mahjong with deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2003.13590.

    [16] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.

    [17] WANG X J, SONG J X, QI P H, et al. SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II[C]∥Proc.of the 38th International Conference on Machine Learning, 2021, 139: 10905-10915.

    [18] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1912.06680.

    [19] YE D H, CHEN G B, ZHAO P L, et al. Supervised learning achieves human-level performance in MOBA games: a case study of honor of kings[J]. IEEE Trans.on Neural Networks and Learning Systems, 2022, 33(3): 908-918.

    [20] 中國(guó)科學(xué)院自動(dòng)化研究所. 人機(jī)對(duì)抗智能技術(shù)[EB/OL]. [2021-08-01]. http:∥turingai.ia.ac.cn/.

    Institute of Automation, Chinese Academy of Science. Intelligent technologies of human-computer gaming[EB/OL]. [2021-08-01]. http:∥turingai.ia.ac.cn/.

    [21] 凡寧, 朱夢(mèng)瑩, 張強(qiáng). 遠(yuǎn)超阿爾法狗?“戰(zhàn)顱”成戰(zhàn)場(chǎng)輔助決策“最強(qiáng)大腦”[EB/OL]. [2021-08-01]. http:∥digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2021-04/19/content_466128.htm?div=-1.

    FAN N, ZHU M Y, ZHANG Q. Way ahead of Alpha Go? “War brain” becomes the “strongest brain” for battlefield decision-making[EB/OL]. [2021-08-01]. http:∥digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2021-04/19/content_466128.htm?div=-1.

    [22] ERNEST N. Genetic fuzzy trees for intelligent control of unmanned combat aerial vehicles[D]. Cincinnati: University of Cincinnati, 2015.

    [23] CLIFF D. Collaborative air combat autonomy program makes strides[J]. Microwave Journal, 2021, 64(5): 43-44.

    [24] STONE P, VELOSO M. Multiagent systems: a survey from a machine learning perspective[J]. Autonomous Robots, 2000, 8(3): 345-383.

    [25] GORDON G J. Agendas for multi-agent learning[J]. Artificial Intelligence, 2007, 171(7): 392-401.

    [26] SHOHAM Y, POWERS R, GRENAGER T. Multi-agent reinforcement learning: a critical survey[R]. San Francisco: Stanford University, 2003.

    [27] SHOHAM Y, POWERS R, GRENAGER T. If multi-agent learning is the answer, what is the question?[J]. Artificial Intelligence, 2006, 171(7): 365-377.

    [28] STONE P. Multiagent learning is not the answer. It is the question[J]. Artificial Intelligence, 2007, 171(7): 402-405.

    [29] TOSIC P, VILALTA R. A unified framework for reinforcement learning, co-learning and meta-learning how to coordinate in collaborative multi-agent systems[J]. Procedia Computer Science, 2010, 1(1): 2217-2226.

    [30] TUYLS K, WEISS G. Multiagent learning: basics, challenges, and prospects[J]. AI Magazine, 2012, 33(3): 41-52.

    [31] KENNEDY J. Swarm intelligence[M]. Handbook of nature-inspired and innovative computing. Bostonm: Springer, 2006: 187-219.

    [32] TUYLS K, PARSONS S. What evolutionary game theory tells us about multiagent learning[J]. Artificial Intelligence, 2007, 171(7): 406-416.

    [33] SILVA F, COSTA A. Transfer learning for multiagent reinforcement learning systems[C]∥Proc.of the 25th International Joint Conference on Artificial Intelligence, 2016: 3982-3983.

    [34] HERNANDEZ-LEAL P, KAISERS M, BAARSLAG T, et al. A survey of learning in multiagent environments: dealing with non-stationarity[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1707.09183v1.

    [35] ALBRECHT S V, STONE P. Autonomous agents modelling other agents: a comprehensive survey and open problems[J]. Artificial Intelligence, 2018, 258: 66-95.

    [36] JANT H P, TUYLS K, PANAIT L, et al. An overview of cooperative and competitive multiagent learning[C]∥Proc.of the International Workshop on Learning and Adaption in Multi-Agent Systems, 2005.

    [37] PANAIT L, LUKE S. Cooperative multi-agent learning: the state of the art[J]. Autonomous Agents and Multi-Agent Systems, 2005, 11(3): 387-434.

    [38] BUSONIU L, BABUSKA R, SCHUTTER B D. A comprehensive survey of multiagent reinforcement learning[J]. IEEE Trans.on Systems, Man amp; Cybernetics: Part C, 2008, 38(2): 156-172.

    [39] HERNANDEZ-LEAL P, KARTAL B, TAYLOR M E. A survey and critique of multiagent deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797.

    [40] OROOJLOOY A, HAJINEZHAD D. A review of cooperative multi-agent deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1908.03963.

    [41] ZHANG K Q, YANG Z R, BAAR T. Multi-agent reinforcement learning: a selective overview of theories and algorithms[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1911.10635.

    [42] GRONAUER S, DIEPOLD K. Multi-agent deep reinforcement learning: a survey[J]. Artificial Intelligence Review, 2022, 55(2): 895-943.

    [43] DU W, DING S F. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications[J]. Artificial Intelligence Review, 2021, 54(5): 3215-3238.

    [44] 吳軍, 徐昕, 王健, 等. 面向多機(jī)器人系統(tǒng)的增強(qiáng)學(xué)習(xí)研究進(jìn)展綜述[J]. 控制與決策, 2011, 26(11): 1601-1610.

    WU J, XU X, WANG J, et al. Recent advances of reinforcement learning in multi-robot systems: a survey[J]. Control and Decision, 2011, 26(11): 1601-1610.

    [45] 杜威, 丁世飛. 多智能體強(qiáng)化學(xué)習(xí)綜述[J]. 計(jì)算機(jī)科學(xué), 2019, 46(8): 1-8.

    DU W, DING S F, Overview on multi-agent reinforcement learning[J]. Computer Science, 2019, 46(8): 1-8.

    [46] 殷昌盛, 楊若鵬, 朱巍, 等. 多智能體分層強(qiáng)化學(xué)習(xí)綜述[J]. 智能系統(tǒng)學(xué)報(bào), 2020, 15(4): 646-655.

    YIN C S, YANG R P, ZHU W, et al. A survey on multi-agent hierarchical reinforcement learning[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4): 646-655.

    [47] 梁星星, 馮旸赫, 馬揚(yáng), 等. 多Agent深度強(qiáng)化學(xué)習(xí)綜述[J]. 自動(dòng)化學(xué)報(bào), 2020, 46(12): 2537-2557.

    LIANG X X, FENG Y H, MA Y, et al. Deep multi-agent reinforcement learning: a survey[J]. Acta Automatica Sinica, 2020, 46(12): 2537-2557.

    [48] 孫長(zhǎng)銀, 穆朝絮. 多智能體深度強(qiáng)化學(xué)習(xí)的若干關(guān)鍵科學(xué)問(wèn)題[J]. 自動(dòng)化學(xué)報(bào), 2020, 46(7): 1301-1312.

    SUN C Y, MU C X. Important scientific problems of multi-agent deep reinforcement learning[J]. Acta Automatica Sinica, 2020, 46(7): 1301-1312.

    [49] MATIGNON L, LAURENT G J, LE F P. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31.

    [50] NOWE A, VRANCX P, HAUWERE Y M D. Game theory and multi-agent reinforcement learning[M]. Berlin: Springer,2012.

    [51] LU Y L, YAN K. Algorithms in multi-agent systems: a holistic perspective from reinforcement learning and game theory[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2001.06487.

    [52] YANG Y D, WANG J. An overview of multi-agent reinforcement learning from game theoretical perspective[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.00583v3.

    [53] BLOEMBERGEN D, TUYLS K, HENNES D, et al, Evolutionary dynamics of multi-agent learning: a survey[J]. Artificial Intelligence, 2015, 53(1): 659-697.

    [54] WONG A, BACK T, ANNA V, et al. Multiagent deep reinforcement learning: challenges and directions towards human-like approaches[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.15691.

    [55] OLIEHOEK F A, AMATO C. A concise introduction to decentralized POMDPs[M]. Berlin: Springer, 2016.

    [56] DOSHI P, ZENG Y F, CHEN Q Y. Graphical models for interactive POMDPs: representations and solutions[J]. Autonomous Agents and Multi-Agent Systems, 2009, 18(3): 376-386.

    [57] SHAPLEY L S. Stochastic games[J]. National Academy of Sciences of the United States of America, 1953, 39(10): 1095-1100.

    [58] LITTMAN M L. Markov games as a framework for multi-agent reinforcement learning[C]∥Proc.of the 11th International Conference on International Conference on Machine Learning, 1994: 157-163.

    [59] KOVAIK V, SCHMID M, BURCH N, et al. Rethinking formal models of partially observable multiagent decision making[J]. Artificial Intelligence, 2022, 303: 103645.

    [60] LOCKHART E, LANCTOT M, PEROLAT J, et al. Computing approximate equilibria in sequential adversarial games by exploitability descent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1903.05614.

    [61] CUI Q, YANG L F. Minimax sample complexity for turn-based stochastic game[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.14267.

    [62] HERNANDEZ D, GBADAMOSI C, GOODMAN J, et al. Metagame autobalancing for competitive multiplayer games[C]∥Proc.of the IEEE Conference on Games, 2020: 275-282.

    [63] WELLMAN M P. Methods for empirical game-theoretic analysis[C]∥Proc.of the 21st National Conference on Artificial Intelligence, 2006: 1552-1555.

    [64] JIANG X, LIM L H, YAO Y, et al. Statistical ranking and combinatorial Hodge theory[J]. Mathematical Programming, 2011, 127(1): 203-244.

    [65] CANDOGAN O, MENACHE I, OZDAGLAR A, et al. Flows and decompositions of games: harmonic and potential games[J]. Mathematics of Operations Research, 2011, 36(3): 474-503.

    [66] HWANG S H, REY-BELLET L. Strategic decompositions of normal form games: zero-sum games and potential games[J]. Games and Economic Behavior, 2020, 122: 370-390.

    [67] BALDUZZI D, GARNELO M, BACHRACH Y, et al. Open-ended learning in symmetric zero-sum games[C]∥Proc.of the International Conference on Machine Learning, 2019: 434-443.

    [68] CZARNECKI W M, GIDEL G, TRACEY B, et al. Real world games look like spinning tops[C]∥Proc.of the 34th International Conference on Neural Information Processing Systems, 2020: 17443-17454.

    [69] SANJAYA R, WANG J, YANG Y D. Measuring the non-transitivity in chess [EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2110.11737.

    [70] TUYLS K, PEROLAT J, LANCTOT M, et al. Bounds and dynamics for empirical game theoretic analysis[J]. Autonomous Agents and Multi-Agent Systems, 2020, 34(1): 7.

    [71] VIQUEIRA E A, GREENWALD A, COUSINS C, et al. Learning simulation-based games from data[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multi Agent Systems, 2019: 1778-1780.

    [72] ROUGHGARDEN T. Twenty lectures on algorithmic game theory[M]. New York: Cambridge University Press, 2016.

    [73] BLUM A, HAGHTALAB N, HAJIAGHAYI M T, et al. Computing Stackelberg equilibria of large general-sum games[C]∥Proc.of the International Symposium on Algorithmic Game Theory, 2019: 168-182.

    [74] MILEC D, CERNY J, LISY V, et al. Complexity and algorithms for exploiting quantal opponents in large two-player games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5575-5583.

    [75] BALDUZZI D, TUYLS K, PEROLAT J, et al. Re-evaluating evaluation[C]∥Proc.of the 32nd International Conference on Neural Information Processing Systems, 2018: 3272-3283.

    [76] LI S H, WU Y, CUI X Y, et al. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 4213-4220.

    [77] YABU Y, YOKOO M, IWASAKI A. Multiagent planning with trembling-hand perfect equilibrium in multiagent POMDPs[C]∥Proc.of the Pacific Rim International Conference on Multi-Agents, 2017: 13-24.

    [78] GHOROGHI A. Multi-games and Bayesian Nash equilibriums[D]. London: University of London, 2015.

    [79] XU X, ZHAO Q. Distributed no-regret learning in multi-agent systems: challenges and recent developments[J]. IEEE Signal Processing Magazine, 2020, 37(3):84-91.

    [80] SUN Y, WEI X, YAO Z H, et al. Analysis of network attack and defense strategies based on Pareto optimum[J]. Electro-nics, 2018, 7(3): 36.

    [81] DENG X T, LI N Y, MGUNI D, et al. On the complexity of computing Markov perfect equilibrium in general-sum stochastic games[EB/OL]. [2021-11-01]. http:∥arxiv.org/abs/2109.01795.

    [82] BASILICO N, CELLI A, GATTI N, et al. Computing the team-maxmin equilibrium in single-team single-adversary team games[J]. Intelligenza Artificiale, 2017, 11(1): 67-79.

    [83] CELLI A, GATTI N. Computational results for extensive-form adversarial team games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1711.06930.

    [84] ZHANG Y Z, AN B. Computing team-maxmin equilibria in zero-sum multiplayer extensive-form games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2020: 2318-2325.

    [85] LI S X, ZHANG Y Z, WANG X R, et al. CFR-MIX: solving imperfect information extensive-form games with combinatorial action space[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2105.08440.

    [86] PROBO G. Multi-team games in adversarial settings: ex-ante coordination and independent team members algorithms[D]. Milano: Politecnico Di Milano, 2019.

    [87] ORTIZ L E, SCHAPIRE R E, KAKADE S M. Maximum entropy correlated equilibria[C]∥Proc.of the 11th International Conference on Artificial Intelligence and Statistics, 2007: 347-354.

    [88] GEMP I, SAVANI R, LANCTOT M, et al. Sample-based approximation of Nash in large many-player games via gradient descent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.01285.

    [89] FARINA G, BIANCHI T, SANDHOLM T. Coarse correlation in extensive-form games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2020: 1934-1941.

    [90] FARINA G, CELLI A, MARCHESI A, et al. Simple uncoupled no-regret learning dynamics for extensive-form correlated equilibrium[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.01520.

    [91] XIE Q M, CHEN Y D, WANG Z R, et al. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2002.07066.

    [92] HUANG S J, YI P. Distributed best response dynamics for Nash equilibrium seeking in potential games[J]. Control Theory and Technology, 2020, 18(3): 324-332.

    [93] BOSANSKY B, KIEKINTVELD C, LISY V, et al. An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information[J]. Journal of Artificial Intelligence Research, 2014, 51(1): 829-866.

    [94] HEINRICH T, JANG Y J, MUNGO C. Best-response dyna-mics, playing sequences, and convergence to equilibrium in random games[J]. International Journal of Game Theory, 2023, 52: 703-735.

    [95] FARINA G, CELLI A, MARCHESI A, et al. Simple uncoupled no-regret learning dynamics for extensive-form correlated equilibrium[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.01520.

    [96] HU S Y, LEUNG C W, LEUNG H F, et al. The evolutionary dynamics of independent learning agents in population games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.16068.

    [97] LEONARDOS S, PILIOURAS G. Exploration-exploitation in multi-agent learning: catastrophe theory meets game theory[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 11263-11271.

    [98] POWERS R, SHOHAM Y. New criteria and a new algorithm for learning in multi-agent systems[C]∥Proc.of the 17th International Conference on Neural Information Processing Systems, 2004: 1089-1096.

    [99] DIGIOVANNI A, ZELL E C. Survey of self-play in reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.02850.

    [100] BOWLING M. Multiagent learning in the presence of agents with limitations[D]. Pittsburgh: Carnegie Mellon University, 2003.

    [101] BOWLING M H, VELOSO M M. Multi-agent learning using a variable learning rate[J]. Artificial Intelligence, 2002, 136(2): 215-250.

    [102] BOWLING M. Convergence and no-regret in multiagent learning[C]∥Proc.of the 17th International Conference on Neural Information Processing Systems, 2004: 209-216.

    [103] KAPETANAKIS S, KUDENKO D. Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems[C]∥Proc.of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems, 2004: 1258-1259.

    [104] DAI Z X, CHEN Y Z, LOW K H, et al. R2-B2: recursive reasoning-based Bayesian optimization for no-regret learning in games[C]∥Proc.of the International Conference on Machine Learning, 2020: 2291-2301.

    [105] FREEMAN R, PENNOCK D M, PODIMATA C, et al. No-regret and incentive-compatible online learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2002.08837.

    [106] LITTMAN M L. Value-function reinforcement learning in Markov games[J]. Journal of Cognitive Systems Research, 2001, 2(1): 55-66.

    [107] FOERSTER J N, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1709.04326.

    [108] RDULESCU R, VERSTRAETEN T, ZHANG Y, et al. Opponent learning awareness and modelling in multi-objective normal form games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.07290.

    [109] RONEN I B, MOSHE T. R-MAX: a general polynomial time algorithm for near-optimal reinforcement learning[J]. Journal of Machine Learning Research, 2002, 3(10): 213-231.

    [110] HIMABINDU L, ECE K, RICH C, et al. Identifying unknown unknowns in the open world: representations and policies for guided exploration[C]∥Proc.of the 31st AAAI Conference on Artificial Intelligence, 2017: 2124-2132.

    [111] PABLO H, MICHAEL K. Learning against sequential opponents in repeated stochastic games[C]∥Proc.of the 3rd Multi-Disciplinary Conference on Reinforcement Learning and Decision Making, 2017.

    [112] PABLO H, YUSEN Z, MATTHEW E, et al. Efficiently detecting switches against non-stationary opponents[J]. Auto-nomous Agents and Multi-Agent Systems, 2017, 31(4): 767-789.

    [113] FRIEDRICH V D O, MICHAEL K, TIM M. The minds of many: opponent modelling in a stochastic game[C]∥Proc.of the 26th International Joint Conference on Artificial Intelligence, 2017: 3845-3851.

    [114] BAKKES S, SPRONCK P, HERIK H. Opponent modelling for case-based adaptive game AI[J]. Entertainment Computing, 2010, 1(1): 27-37.

    [115] PAPOUDAKIS G, CHRISTIANOS F, RAHMAN A, et al. Dealing with non-stationarity in multi-agent deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1906.04737.

    [116] DASKALAKIS C, GOLDBERG P W, PAPADIMITRIOU C H. The complexity of computing a Nash equilibrium[J]. SIAM Journal on Computing, 2009, 39(1):195-259.

    [117] CONITZER V, SANDHOLM T. Complexity results about Nash equilibria[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/0205074.

    [118] CONITZER V, SANDHOLM T. New complexity results about Nash equilibria[J]. Games and Economic Behavior, 2008, 63(2): 621-641.

    [119] ZHANG Y Z. Computing team-maxmin equilibria in zero-sum multiplayer games[D]. Singapore: Nanyang Technological University, 2020.

    [120] LAUER M, RIEDMILLER M. An algorithm for distributed reinforcement learning in cooperative multi-agent systems[C]∥Proc.of the 17th International Conference on Machine Learning, 2000: 535-542.

    [121] CLAUS C, BOUTILIER C. The dynamics of reinforcement learning in cooperative multiagent system[C]∥Proc.of the 15th National/10th Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, 1998: 746-752.

    [122] WANG X F, SANDHOLM T. Reinforcement learning to play an optimal Nash equilibrium in team Markov games[C]∥Proc.of the 15th International Conference on Neural Information Processing Systems, 2002: 1603-1610.

    [123] ARSLAN G, YUKSEL S. Decentralized q-learning for stochastic teams and games[J]. IEEE Trans.on Automatic Control, 2016, 62(4): 1545-1558.

    [124] HU J L, WELLMAN M P. Nash Q-learning for general-sum stochastic games[J]. Journal of Machine Learning Research, 2003, 4(11): 1039-1069.

    [125] GREENWALD A, HALL L, SERRANO R. Correlated-q learning[C]∥Proc.of the 20th International Conference on Machine Learning, 2003: 242-249.

    [126] KONONEN V. Asymmetric multi-agent reinforcement learning[J]. Web Intelligence and Agent Systems, 2004, 2(2): 105-121.

    [127] LITTMAN M L. Friend-or-foe q-learning in general-sum games[C]∥Proc.of the 18th International Conference on Machine Learning, 2001: 322-328.

    [128] SINGH S, KEARNS M, MANSOUR Y. Nash convergence of gradient dynamics in iterated general-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1301.3892.

    [129] ZINKEVICH M. Online convex programming and generalized infinitesimal gradient ascent[C]∥Proc.of the 20th International Conference on Machine Learning, 2003: 928-935.

    [130] CONITZER V, SANDHOLM T. AWESOME: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents[J]. Machine Learning, 2007, 67: 23-43.

    [131] TAN M. Multi-agent reinforcement learning: independent vs. cooperative agents[C]∥Proc.of the 10th International Conference on Machine Learning, 1993: 330-337.

    [132] LAETITIA M, GUILLAUME L, NADINE L F. Hysteretic Q learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams[C]∥Proc.of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2007: 64-69.

    [133] MATIGNON L, LAURENT G, LE F P N. A study of FMQ heuristic in cooperative multi-agent games[C]∥Proc.of the 7th International Conference on Autonomous Agents and Multiagent Systems, 2008: 77-91.

    [134] WEI E, LUKE S. Lenient learning in independent-learner stochastic cooperative games[J]. Journal Machine Learning Research, 2016, 17(1): 2914-2955.

    [135] PALMER G. Independent learning approaches: overcoming multi-agent learning pathologies in team-games[D]. Liverpool: University of Liverpool, 2020.

    [136] SUKHBAATAR S, FERGUS R. Learning multiagent communication with backpropagation[C]∥Proc.of the 30th International Conference on Neural Information Processing Systems, 2016: 2244-2252.

    [137] PENG P, WEN Y, YANG Y D, et al. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play StarCraft combat games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1703.10069.

    [138] JAKOB N F, GREGORY F, TRIANTAFYLLOS A, et al, Counterfactual multi-agent policy gradients[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2018: 2974-2982.

    [139] LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2017: 6382-6393.

    [140] WEI E, WICKE D, FREELAN D, et al, Multiagent soft q-learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1804.09817.

    [141] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward[C]∥Proc.of the 17th International Conference on Autonomous Agents and Multi-Agent Systems, 2018: 2085-2087.

    [142] RASHID T, SAMVELYAN M, WITT C S, et al. Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2018: 4292-4301.

    [143] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: multi-agent variational exploration[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1910.07483.

    [144] SON K, KIM D, KANG W J, et al. Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2019: 5887-5896.

    [145] YANG Y D, WEN Y, CHEN L H, et al. Multi-agent determinantal q-learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.01482.

    [146] YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of MAPPO in cooperative, multi-agent games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2103.01955.

    [147] WANG J H, ZHANG Y, KIM T K, et al. Shapley q-value: a local reward approach to solve global reward games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2020: 7285-7292.

    [148] RIEDMILLER M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]∥Proc.of the European Conference on Machine Learning, 2005: 317-328.

    [149] NEDIC A, OLSHEVSKY A, SHI W. Achieving geometric convergence for distributed optimization over time-varying graphs[J]. SIAM Journal on Optimization, 2017, 27(4): 2597-2633.

    [150] ZHANG K Q, YANG Z R, LIU H, et al. Fully decentralized multi-agent reinforcement learning with networked agents[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1802.08757.

    [151] QU G N, LIN Y H, WIERMAN A, et al. Scalable multi-agent reinforcement learning for networked systems with ave-rage reward[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.06626.

    [152] CHU T, CHINCHALI S, KATTI S. Multi-agent reinforcement learning for networked system control[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2004.01339.

    [153] LESAGE-LANDRY A, CALLAWAY D S. Approximate multi-agent fitted q iteration[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.09343.

    [154] ZHANG K Q, YANG Z R, LIU H, et al. Finite-sample analysis for decentralized batch multi-agent reinforcement learning with networked agents[J]. IEEE Trans.on Automatic Control, 2021, 66(12): 5925-5940.

    [155] SANDHOLM T, GILPIN A, CONITZER V. Mixed-integer programming methods for finding Nash equilibria[C]∥Proc.of the 20th National Conference on Artificial Intelligence, 2005: 495-501.

    [156] YU N. Excessive gap technique in nonsmooth convex minimization[J]. SIAM Journal on Optimization, 2005, 16(1): 235-249.

    [157] SUN Z F, NAKHAI M R. An online mirror-prox optimization approach to proactive resource allocation in MEC[C]∥Proc.of the IEEE International Conference on Communications, 2020.

    [158] AMIR B, MARC T. Mirror descent and nonlinear projected subgradient methods for convex optimization[J]. Operations Research Letters, 2003, 31(3): 167-175.

    [159] LOCKHART E, LANCTOT M, PEROLAT J, et al. Computing approximate equilibria in sequential adversarial games by exploitability descent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1903.05614.

    [160] LAN S. Geometrical regret matching: a new dynamics to Nash equilibrium[J]. AIP Advances, 2020, 10(6): 065033.

    [161] VON S B, FORGES F. Extensive-form correlated equilibrium: definition and computational complexity[J]. Mathematics of Operations Research, 2008, 33(4): 1002-1022.

    [162] CESA-BIANCHI N, LUGOSI G. Prediction, learning, and games[M]. Cambridge: Cambridge University Press, 2006.

    [163] FREUND Y, SCHAPIRE R E. Adaptive game playing using multiplicative weights[J]. Games and Economic Behavior, 1999, 29(1/2): 79-103.

    [164] HART S, MAS-COLELL A. A general class of adaptive strategies[J]. Journal of Economic Theory, 2001, 98(1): 26-54.

    [165] LEMKE C E, HOWSON J T. Equilibrium points of bimatrix games[J]. Journal of the Society for Industrial and Applied Mathematics, 1964, 12 (2): 413-423.

    [166] PORTER R, NUDELMAN E, SHOHAM Y. Simple search methods for finding a Nash equilibrium[J]. Games and Economic Behavior, 2008, 63(2): 642-662.

    [167] CEPPI S, GATTI N, PATRINI G, et al. Local search techniques for computing equilibria in two-player general-sum strategic form games[C]∥Proc.of the 9th International Conference on Autonomous Agents and Multiagent Systems, 2010: 1469-1470.

    [168] CELLI A, CONIGLIO S, GATTI N. Computing optimal ex ante correlated equilibria in two-player sequential games[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multiagent Systems, 2019: 909-917.

    [169] VON S B, FORGES F. Extensive-form correlated equilibrium: definition and computational complexity[J]. Mathematics of Operations Research, 2008, 33(4): 1002-1022.

    [170] FARINA G, LING C K, FANG F, et al. Efficient regret minimization algorithm for extensive-form correlated equilibrium[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 5186-5196.

    [171] PAPADIMITRIOU C H, ROUGHGARDEN T. Computing correlated equilibria in multi-player games[J]. Journal of the ACM, 2008, 55(3): 14.

    [172] CELLI A, MARCHESI A, BIANCHI T, et al. Learning to correlate in multi-player general-sum sequential games[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 13076-13086.

    [173] JIANG A X, KEVIN L B. Polynomial-time computation of exact correlated equilibrium in compact games[J]. Games and Economic Behavior, 2015, 100(91): 119-126.

    [174] FOSTER D P, YOUNG H P. Regret testing: learning to play Nash equilibrium without knowing you have an opponent[J]. Theoretical Economics, 2006, 1(3): 341-367.

    [175] ABERNETHY J, BARTLETT P L, HAZAN E. Blackwell approachability and no-regret learning are equivalent[C]∥Proc.of the 24th Annual Conference on Learning Theory, 2011: 27-46.

    [176] FARINA G, KROER C, SANDHOLM T. Faster game solving via predictive Blackwell approachability: connecting regret matching and mirror descent[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5363-5371.

    [177] SRINIVASAN S, LANCTOT M, ZAMBALDI V, et al. Actor-critic policy optimization in partially observable multiagent environments[C]∥Proc.of the 32nd International Conference on Neural Information Processing Systems, 2018: 3426-3439.

    [178] ZINKEVICH M, JOHANSON M, BOWLING M, et al, Regret minimization in games with incomplete information[C]∥Proc.of the 20th International Conference on Neural Information Processing Systems, 2007: 1729-1736.

    [179] BOWLING M, BURCH N, JOHANSON M, et al. Heads-up limit hold’em poker is solved[J]. Science, 2015, 347(6218): 145-149.

    [180] BROWN N, LERER A, GROSS S, et al. Deep counterfactual regret minimization[C]∥Proc.of the International Conference on Machine Learning, 2019: 793-802.

    [181] BROWN N, SANDHOLM T. Solving imperfect-information games via discounted regret minimization[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 1829-1836.

    [182] LI H L, WANG X, QI S H, et al. Solving imperfect-information games via exponential counterfactual regret minimization[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2008.02679.

    [183] LANCTOT M, WAUGH K, ZINKEVICH M, et al. Monte Carlo sampling for regret minimization in extensive games[C]∥Proc.of the 22nd International Conference on Neural Information Processing Systems, 2009: 1078-1086.

    [184] LI H, HU K L, ZHANG S H, et al. Double neural counterfactual regret minimization[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1812.10607.

    [185] JACKSON E G. Targeted CFR[C]∥Proc.of the 31st AAAI Conference on Artificial Intelligence, 2017.

    [186] SCHMID M, BURCH N, LANCTOT M, et al. Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 2157-2164.

    [187] ZHOU Y C, REN T Z, LI J L, et al. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1810.04433.

    [188] WAUGH K, MORRILL D, BAGNELL J A, et al. Solving games with functional regret estimation[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2015: 2138-2144.

    [189] D’ORAZIO R, MORRILL D, WRIGHT J R, et al. Alternative function approximation parameterizations for solving games: an analysis of f-regression counterfactual regret minimization[C]∥Proc.of the 19th International Conference on Autonomous Agents and Multiagent Systems, 2020: 339-347.

    [190] PILIOURAS G, ROWLAND M, OMIDSHAFIEI S, et al. Evolutionary dynamics and Φ-regret minimization in games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.14668v1.

    [191] STEINBERGER E. Single deep counterfactual regret minimization[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1901.07621.

    [192] LI H L, WANG X, GUO Z Y, et al. RLCFR: minimize counterfactual regret with neural networks[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2105.12328.

    [193] LI H L, WANG X, JIA F W, et al. RLCFR: minimize counterfactual regret by deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2009.06373.

    [194] LIU W M, LI B, TOGELIUS J. Model-free neural counterfactual regret minimization with bootstrap learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2012.01870.

    [195] SCHMID M, MORAVCIK M, BURCH N, et al. Player of games[EB/OL]. [2021-12-30]. http:∥arxiv.org/abs/2112.03178.

    [196] CHRISTIAN K, KEVIN W, FATMA K K, et al. Faster first-order methods for extensive-form game solving[C]∥Proc.of the 16th ACM Conference on Economics and Computation, 2015: 817-834.

    [197] LESLIE D S, COLLINS E J. Generalised weakened fictitious play[J]. Games and Economic Behavior, 2006, 56(2): 285-298.

    [198] KROER C, WAUGH K, KLN-KARZAN F, et al. Faster algorithms for extensive-form game solving via improved smoo-thing functions[J]. Mathematical Programming, 2020, 179(1): 385-417.

    [199] FARINA G, KROER C, SANDHOLM T. Optimistic regret minimization for extensive-form games via dilated distance-generating functions[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 5221-5231.

    [200] LIU W M, JIANG H C, LI B, et al. Equivalence analysis between counterfactual regret minimization and online mirror descent[EB/OL]. [2021-12-11]. http:∥arxiv.org/abs/2110.04961.

    [201] PEROLAT J, MUNOS R, LESPIAU J B, et al. From Poincaré recurrence to convergence in imperfect information games: finding equilibrium via regularization[C]∥Proc.of the International Conference on Machine Learning, 2021: 8525-8535.

    [202] MUNOS R, PEROLAT J, LESPIAU J B, et al. Fast computation of Nash equilibria in imperfect information games[C]∥Proc.of the International Conference on Machine Learning, 2020: 7119-7129.

    [203] KAWAMURA K, MIZUKAMI N, TSURUOKA Y. Neural fictitious self-play in imperfect information games with many players[C]∥Proc.of the Workshop on Computer Games, 2017: 61-74.

    [204] ZHANG L, CHEN Y X, WANG W, et al. A Monte Carlo neural fictitious self-play approach to approximate Nash equilibrium in imperfect-information dynamic games[J]. Frontiers of Computer Science, 2021, 15(5): 155334.

    [205] STEINBERGER E, LERER A, BROWN N. DREAM: deep regret minimization with advantage baselines and model-free learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.10410.

    [206] BROWN N, BAKHTIN A, LERER A, et al. Combining deep reinforcement learning and search for imperfect-information games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2007.13544.

    [207] GRUSLYS A, LANCTOT M, MUNOS R, et al. The advantage regret-matching actor-critic[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2008.12234.

    [208] CHEN Y X, ZHANG L, LI S J, et al. Optimize neural fictitious self-play in regret minimization thinking[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.10845.

    [209] SONZOGNI S. Depth-limited approaches in adversarial team games[D]. Milano: Politecnico Di Milano, 2019.

    [210] ZHANG Y Z, AN B. Converging to team maxmin equilibria in zero-sum multiplayer games[C]∥Proc.of the International Conference on Machine Learning, 2020: 11033-11043.

    [211] ZHANG Y Z, AN B, LONG T T, et al. Optimal escape interdiction on transportation networks[C]∥Proc.of the 26th International Joint Conference on Artificial Intelligence, 2017: 3936-3944.

    [212] ZHANG Y Z, AN B. Computing ex ante coordinated team-maxmin equilibria in zero-sum multiplayer extensive-form games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5813-5821.

    [213] ZHANG Y Z, GUO Q Y, AN B, et al. Optimal interdiction of urban criminals with the aid of real-time information[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 1262-1269.

    [214] BOTVINICK M, RITTER S, WANG J X, et al. Reinforcement learning, fast and slow[J]. Trends in Cognitive Sciences, 2019, 23(5): 408-422.

    [215] LANCTOT M, ZAMBALDI V, GRUSLYS A, et al. A unified game-theoretic approach to multiagent reinforcement learning[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2017: 4193-4206.

    [216] MULLER P, OMIDSHAFIEI S, ROWLAND M, et al. A generalized training approach for multiagent learning[C]∥Proc.of the 8th International Conference on Learning Representations, 2020.

    [217] SUN P, XIONG J C, HAN L, et al. TLeague: a framework for competitive self-play based distributed multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.12895.

    [218] ZHOU M, WAN Z Y, WANG H J, et al. MALib: a parallel framework for population-based multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.07551.

    [219] LISY V, BOWLING M. Eqilibrium approximation quality of current no-limit poker bots[C]∥Proc.of the 31st AAAI Conference on Artificial Intelligence, 2017.

    [220] CLOUD A, LABER E. Variance decompositions for extensive-form games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2009.04834.

    [221] SUSTR M, SCHMID M, MORAVCK M. Sound algorithms in imperfect information games[C]∥Proc.of the 20th International Conference on Autonomous Agents and Multiagent Systems, 2021: 1674-1676.

    [222] BREANNA M, COMPARING E, GLICKO I. Bayesian IRT statistical models for educational and gaming data[D]. Fayetteville: University of Arkansas, 2019.

    [223] PANKIEWICZ M, BATOR M. Elo rating algorithm for the purpose of measuring task difficulty in online learning environments[J]. E-Mentor, 2019, 82(5): 43-51.

    [224] GLICKMAN M E. The glicko system[M]. Boston: Boston University, 1995.

    [225] HERBRICH R, MINKA T, GRAEPEL T. TrueskillTM: a Bayesian skill rating system[C]∥Proc.of the 19th International Conference on Neural Information Processing Systems, 2006: 569-576.

    [226] OMIDSHAFIEI S, PAPADIMITRIOU C, PILIOURAS G, et al. α-Rank: multi-agent evaluation by evolution[J]. Scientific Reports, 2019, 9(1): 9937.

    [227] YANG Y D, TUTUNOV R, SAKULWONGTANA P, et al. αα-Rank: practically scaling α-rank through stochastic optimisation[C]∥Proc.of the 19th International Conference on Autonomous Agents and Multiagent Systems, 2020: 1575-1583.

    [228] ROWLAND M, OMIDSHAFIEI S, TUYLS K, et al. Multiagent evaluation under incomplete information[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 12291-12303.

    [229] RASHID T, ZHANG C, CIOSEK K, et al. Estimating α-rank by maximizing information gain[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5673-5681.

    [230] DU Y L, YAN X, CHEN X, et al. Estimating α-rank from a few entries with low rank matrix completion[C]∥Proc.of the International Conference on Machine Learning, 2021: 2870-2879.

    [231] ROOHI S, GUCKELSBERGER C, RELAS A, et al. Predicting game engagement and difficulty using AI players[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.12061.

    [232] OBRIEN J D, GLEESON J P. A complex networks approach to ranking professional Snooker players[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2010.08395.

    [233] JORDAN S M, CHANDAK Y, COHEN D, et al. Evaluating the performance of reinforcement learning algorithms[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.16958.

    [234] DEHPANAH A, GHORI M F, GEMMELL J, et al. The evaluation of rating systems in online free-for-all games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.16958.

    [235] LEIBO J Z, DUEEZ-GUZMAN E, VEZHNEVETS A S, et al. Scalable evaluation of multi-agent reinforcement learning with melting pot[C]∥Proc.of the International Conference on Machine Learning, 2021: 6187-6199.

    [236] EBTEKAR A, LIU P. Elo-MMR: a rating system for massive multiplayer competitions[C]∥Proc.of the Web Conference, 2021: 1772-1784.

    [237] DEHPANAH A, GHORI M F, GEMMELL J, et al. Evaluating team skill aggregation in online competitive games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.11397.

    [238] HERNANDEZ D, DENAMGANAI K, DEVLIN S, et al. A comparison of self-play algorithms under a generalized framework[J]. IEEE Trans.on Games, 2021, 14(2): 221-231.

    [239] LEIGH R, SCHONFELD J, LOUIS S J. Using coevolution to understand and validate game balance in continuous games[C]∥Proc.of the 10th Annual Conference on Genetic and Evolutionary Computation, 2008: 1563-1570.

    [240] SAYIN M O, PARISE F, OZDAGLAR A. Fictitious play in zero-sum stochastic games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2010.04223.

    [241] JADERBERG M, CZARNECKI W M, DUNNING I, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859-865.

    [242] SAMUEL A L. Some studies in machine learning using the game of checkers[J]. IBM Journal of Research and Development, 2000, 44(1/2): 206-226.

    [243] BANSAL T, PACHOCKI J, SIDOR S, et al. Emergent complexity via multi-agent competition[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1710.03748.

    [244] SUKHBAATAR S, LIN Z, KOSTRIKOV I, et al. Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1703.05407.

    [245] ADAM L, HORCIK R, KASL T, et al. Double oracle algorithm for computing equilibria in continuous games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5070-5077.

    [246] WANG Y Z, MA Q R, WELLMAN M P. Evaluating strategy exploration in empirical game-theoretic analysis[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2105.10423.

    [247] SHOHEI O. Unbiased self-play[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.03007.

    [248] HENDON E, JACOBSEN H J, SLOTH B. Fictitious play in extensive form games[J]. Games and Economic Behavior, 1996, 15(2): 177-202.

    [249] HEINRICH J, LANCTOT M, SILVER D. Fictitious self-play in extensive-form games[C]∥Proc.of the International Conference on Machine Learning, 2015: 805-813.

    [250] LIU B Y, YANG Z R, WANG Z R. Policy optimization in zero-sum Markov games: fictitious self-play provably attains Nash equilibria[EB/OL]. [2021-08-01]. https:∥openreview.net/forum?id=c3MWGN_cTf.

    [251] HOFBAUER J, SANDHOLM W H. On the global convergence of stochastic fictitious play[J]. Econometrica, 2002, 70(6): 2265-2294.

    [252] FARINA G, CELLI A, GATTI N, et al. Ex ante coordination and collusion in zero-sum multi-player extensive-form games[C]∥Proc.of the 32nd International Conference on Neural Information Processing Systems, 2018: 9661-9671.

    [253] HEINRICH J. Deep reinforcement learning from self-play in imperfect-information games[D]. London: University College London, 2016.

    [254] NIEVES N P, YANG Y, SLUMBERS O, et al. Modelling behavioural diversity for learning in open-ended games[C]∥Proc.of the International Conference on Machine Learning, 2021: 8514-8524.

    [255] KLIJN D, EIBEN A E. A coevolutionary approach to deep multi-agent reinforcement learning[C]∥Proc.of the Genetic and Evolutionary Computation Conference, 2021.

    [256] WRIGHT M, WANG Y, WELLMAN M P. Iterated deep reinforcement learning in games: history-aware training for improved stability[C]∥Proc.of the ACM Conference on Economics and Computation, 2019: 617-636.

    [257] SMITH M O, ANTHONY T, WANG Y, et al. Learning to play against any mixture of opponents[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2009.14180.

    [258] SMITH M O, ANTHONY T, WELLMAN M P. Iterative empirical game solving via single policy best response[C]∥Proc.of the International Conference on Learning Representations, 2020.

    [259] MARRIS L, MULLER P, LANCTOT M, et al. Multi-agent training beyond zero-sum with correlated equilibrium meta-solvers[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.09435.

    [260] MCALEER S, LANIER J, FOX R, et al. Pipeline PSRO: a scalable approach for finding approximate Nash equilibria in large games[C]∥Proc.of the 34th International Conference on Neural Information Processing Systems, 2020, 33: 20238-20248.

    [261] DINH L C, YANG Y, TIAN Z, et al. Online double oracle[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2103.07780.

    [262] FENG X D, SLUMBERS O, YANG Y D, et al. Discovering multi-agent auto-curricula in two-player zero-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.02745.

    [263] MCALEER S, WANG K, LANCTOT M, et al. Anytime optimal PSRO for two-player zero-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2201.07700.

    [264] ZHOU M, CHEN J X, WEN Y, et al. Efficient policy space response oracles[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2202.00633.

    [265] LIU S Q, MARRIS L, HENNES D, et al. NeuPL: neural population learning[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2202.07415.

    [266] YANG Y D, LUO J, WEN Y, et al. Diverse auto-curriculum is critical for successful real-world multiagent learning systems[C]∥Proc.of the 20th International Conference on Autonomous Agents and Multiagent Systems, 2021: 51-56.

    [267] WU Z, LI K, ZHAO E M, et al. L2E: learning to exploit your opponent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.09381.

    [268] LEIBO J Z, HUGHES E, LANCTOT M, et al. Autocurricula and the emergence of innovation from social interaction: a manifesto for multi-agent intelligence research[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1903.00742.

    [269] LIU X Y, JIA H T, WEN Y, et al. Unifying behavioral and response diversity for open-ended learning in zero-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.04958.

    [270] MOURET J B. Evolving the behavior of machines: from micro to macroevolution[J]. Iscience, 2020, 23(11): 101731.

    [271] MCKEE K R, LEIBO J Z, BEATTIE C, et al. Quantifying environment and population diversity in multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.08370.

    [272] PACCHIANO A, HOLDER J P, CHOROMANSKI K M, et al. Effective diversity in population-based reinforcement learning[C]∥Proc.of the 34th International Conference on Neural Information Processing Systems, 2020: 18050-18062.

    [273] MASOOD M A, FINALE D V. Diversity-inducing policy gradient: using maximum mean discrepancy to find a set of diverse policies[C]∥Proc.of the 28th International Joint Conference on Artificial Intelligence, 2019: 5923-5929.

    [274] GARNELO M, CZARNECKI W M, LIU S, et al. Pick your battles: interaction graphs as population-level objectives for strategic diversity[C]∥Proc.of the 20th International Conference on Autonomous Agents and Multi-Agent Systems, 2021: 1501-1503.

    [275] TAVARES A, AZPURUA H, SANTOS A, et al. Rock, paper, StarCraft: strategy selection in real-time strategy games[C]∥Proc.of the 12th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 2016: 93-99.

    [276] PABLO H L, ENRIQUE M C, SUCAR L E. A framework for learning and planning against switching strategies in repeated games[J]. Connection Science, 2014, 26(2): 103-122.

    [277] FEI Y J, YANG Z R, WANG Z R, et al. Dynamic regret of policy optimization in non-stationary environments[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2020: 6743-6754.

    [278] WRIGHT M, VOROBEYCHIK Y. Mechanism design for team formation[C]∥Proc.of the AAAI 29th AAAI Conference on Artificial Intelligence, 2015: 1050-1056.

    [279] AUER P, JAKSCH T, ORTNER R, et al. Near-optimal regret bounds for reinforcement learning[C]∥Proc.of the 21st International Conference on Neural Information Processing Systems, 2008: 89-96.

    [280] HE J F, ZHOU D R, GU Q Q, et al. Nearly optimal regret for learning adversarial MDPs with linear function approximation[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.08940.

    [281] MEHDI J J, RAHUL J, ASHUTOSH N. Online learning for unknown partially observable MDPs[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.12661.

    [282] TIAN Y, WANG Y H, YU T C, et al. Online learning in unknown Markov games[C]∥Proc.of the International Conference on Machine Learning, 2021: 10279-10288.

    [283] KASH I A, SULLINS M, HOFMANN K. Combining no-regret and q-learning[C]∥Proc.of the 19th International Conference on Autonomous Agents and Multi-Agent Systems, 2020: 593-601.

    [284] LIN T Y, ZHOU Z Y, MERTIKOPOULOS P, et al. Finite-time last-iterate convergence for multi-agent learning in games[C]∥Proc.of the International Conference on Machine Learning, 2020: 6161-6171.

    [285] LEE C W, KROER C, LUO H P. Last-iterate convergence in extensive-form games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.14326.

    [286] DASKALAKIS C, FISHELSON M, GOLOWICH N. Near-optimal no-regret learning in general games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2108.06924.

    [287] MORRILL D, D’ORAZIO R, SARFATI R, et al. Hindsight and sequential rationality of correlated play[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5584-5594.

    [288] LI X. Opponent modeling and exploitation in poker using evolved recurrent neural networks[D]. Austin: University of Texas at Austin, 2018.

    [289] GANZFRIED S. Computing strong game-theoretic strategies and exploiting suboptimal opponents in large games[D]. Pittsburgh: Carnegie Mellon University, 2015.

    [290] DAVIS T, WAUGH K, BOWLING M. Solving large extensive-form games with strategy constraints[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 1861-1868.

    [291] KIM D K, LIU M, RIEMER M, et al. A policy gradient algorithm for learning to learn in multiagent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2021: 5541-5550.

    [292] SILVA F, COSTA A, STONE P. Building self-play curricula online by playing with expert agents in adversarial games[C]∥Proc.of the 8th Brazilian Conference on Intelligent Systems, 2019: 479-484.

    [293] SUSTR M, KOVARK V, LISY V. Monte Carlo continual resolving for online strategy computation in imperfect information games[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multi-Agent Systems, 2019: 224-232.

    [294] BROWN N, SANDHOLM T. Safe and nested subgame solving for imperfect-information games[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2017: 689-699.

    [295] TIAN Z. Opponent modelling in multi-agent systems[D]. London: University College London, 2021.

    [296] WANG T H, DONG H, LESSER V, et al. ROMA: multi-agent reinforcement learning with emergent roles[C]∥Proc.of the International Conference on Machine Learning, 2020: 9876-9886.

    [297] GONG L X, FENG X C, YE D Z, et al, OptMatch: optimized matchmaking via modeling the high-order interactions on the arena[C]∥Proc.of the 26th ACM SIGKDD International Conference on Knowledge Discovery amp; Data Mining, 2020: 2300-2310.

    [298] HU H Y, LERER A, PEYSAKHOVICH A, et al. “Other-play” for zero-shot coordination[C]∥Proc.of the International Conference on Machine Learning, 2020: 4399-4410.

    [299] TREUTLEIN J, DENNIS M, OESTERHELD C, et al. A new formalism, method and open issues for zero-shot coordination[C]∥Proc.of the International Conference on Machine Learning, 2021: 10413-10423.

    [300] LUCERO C, IZUMIGAWA C, FREDERIKSEN K, et al. Human-autonomy teaming and explainable AI capabilities in RTS games[C]∥Proc.of the International Conference on Human-Computer Interaction, 2020: 161-171.

    [301] WAYTOWICH N, BARTON S L, LAWHERN V, et al. Grounding natural language commands to StarCraft II game states for narration-guided reinforcement learning[J]. Artificial Intelligence and Machine Learning for Multi-Domain Ope-rations Applications, 2019, 11006: 267-276.

    [302] SIU H C, PENA J D, CHANG K C, et al. Evaluation of human-AI teams for learned and rule-based agents in Hanabi[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.07630.

    [303] KOTSERUBA I, TSOTSOS J K. 40 years of cognitive architectures: core cognitive abilities and practical applications[J]. Artificial Intelligence Review, 2020, 53(1): 17-94.

    [304] ALEXANDER K. Adversarial reasoning: computational approaches to reading the opponent’s mind[M]. Boca Raton: Chapman amp; Hall/CRC, 2006.

    [305] KULKARNI A. Synthesis of interpretable and obfuscatory behaviors in human-aware AI systems[D]. Arizona: Arizona State University, 2020.

    [306] ZHENG Y, HAO J Y, ZHANG Z Z, et al. Efficient policy detecting and reusing for non-stationarity in Markov games[J]. Autonomous Agents and Multi-Agent Systems, 2021, 35(1): 1-29.

    [307] SHEN M, HOW J P. Safe adaptation in multiagent competition[EB/OL]. [2022-03-12]. http:∥arxiv.org/abs/2203.07562.

    [308] HAWKIN J. Automated abstraction of large action spaces in imperfect information extensive-form games[D]. Edmonton: University of Alberta, 2014.

    [309] ABEL D. A theory of abstraction in reinforcement learning[D]. Providence: Brown University, 2020.

    [310] YANG Y D, RUI L, LI M N, et al. Mean field multi-agent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2018: 5571-5580.

    [311] JI K Y. Bilevel optimization for machine learning: algorithm design and convergence analysis[D]. Columbus: Ohio State University, 2020.

    [312] BOSSENS D M, TARAPORE D. Quality-diversity meta-evolution: customising behaviour spaces to a meta-objective[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2109.03918v1.

    [313] MAJID A Y, SAAYBI S, RIETBERGEN T, et al. Deep reinforcement learning versus evolution strategies: a comparative survey[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2110.01411.

    [314] RAMPONI G. Challegens and opportunities in multi-agent reinforcement learnings[D]. Milano: Politecnico Di Milano, 2021.

    [315] KHETARPAL K, RIEMER M, RISH I, et al. Towards continual reinforcement learning: a review and perspectives[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2012.13490.

    [316] MENG D Y, ZHAO Q, JIANG L. A theoretical understanding of self-paced learning[J]. Information Sciences, 2017, 414: 319-328.

    [317] 尹奇躍, 趙美靜, 倪晚成, 等. 兵棋推演的智能決策技術(shù)與挑戰(zhàn)[J]. 自動(dòng)化學(xué)報(bào), 2021, 47(5): 913-928.

    YIN Q Y, ZHAO M J, NI W C, et al. Intelligent decision making technology and challenge of wargame[J]. Acta Automatica Sinica, 2021, 47(5): 913-928.

    [318] 程愷, 陳剛, 余曉晗, 等. 知識(shí)牽引與數(shù)據(jù)驅(qū)動(dòng)的兵棋AI設(shè)計(jì)及關(guān)鍵技術(shù)[J]. 系統(tǒng)工程與電子技術(shù), 2021, 43(10): 2911-2917.

    CHENG K, CHEN G, YU X H, et al. Knowledge traction and data-driven wargame AI design and key technologies[J]. Systems Engineering and Electronics, 2021, 43(10): 2911-2917.

    [319] 蒲志強(qiáng), 易建強(qiáng), 劉振, 等. 知識(shí)和數(shù)據(jù)協(xié)同驅(qū)動(dòng)的群體智能決策方法研究綜述[J]. 自動(dòng)化學(xué)報(bào), 2022, 48(3): 627-643.

    PU Z Q, YI J Q, LIU Z, et al. Knowledge-based and data-driven integrating methodologies for collective intelligence decision making: a survey[J]. Acta Automatica Sinica, 2022, 48(3): 627-643.

    [320] 張馭龍, 范長(zhǎng)俊, 馮旸赫, 等. 任務(wù)級(jí)兵棋智能決策技術(shù)框架設(shè)計(jì)與關(guān)鍵問(wèn)題分析[J]. 指揮與控制學(xué)報(bào), 2024, 10(1): 19-25.

    ZHANG Y L, FAN C J, FENG Y H, et al. Technical framework design and key issues analysis in task-level wargame intelligent decision making[J]. Journal of Command and Control, 2024, 10(1): 19-25.

    [321] CHEN L L, LU K, RAJESWARAN A, et al. Decision transformer: reinforcement learning via sequence modeling[C]∥Proc.of the 35th Conference on Neural Information Processing Systems, 2021: 15084-15097.

    [322] MENG L H, WEN M N, YANG Y D, et al. Offline pre-trained multi-agent decision transformer: one big sequence model conquers all StarCraft II tasks[EB/OL]. [2022-01-01]. http:∥arxiv.org/abs/2112.02845.

    [323] ZHENG Q Q, ZHANG A, GROVER A. Online decision transformer [EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2202.05607.

    [324] MATHIEU M, OZAIR S, SRINIVASAN S, et al. StarCraft II unplugged: large scale offline reinforcement learning[C]∥Proc.of the Deep RL Workshop NeurIPS 2021, 2021.

    [325] SAMVELYAN M, RASHID T, SCHROEDER D W C, et al. The StarCraft multi-agent challenge[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multi-agent Systems, 2019: 2186-2188.

    [326] LANCTOT M, LOCKHART E, LESPIAU J B, et al. OpenSpiel: a framework for reinforcement learning in games[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/1908.09453.

    [327] TERRY J K, BLACK B, GRAMMEL N, et al. PettingZoo: gym for multi-agent reinforcement learning[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2009.14471.

    [328] PRETORIUS A, TESSERA K, SMIT A P, et al. MAVA: a research framework for distributed multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.01460.

    [329] YAO M, YIN Q Y, YANG J, et al. The partially observable asynchronous multi-agent cooperation challenge[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2112.03809.

    [330] MORITZ P, NISHIHARA R, WANG S, et al. Ray: a distributed framework for emerging AI applications[C]∥Proc.of the 13th USENIX Symposium on Operating Systems Design and Implementation, 2018: 561-577.

    [331] ESPEHOLT L, MARINIER R, STANCZYK P, et al. SEED RL: scalable and efficient deep-RL with accelerated central inference[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/1910.06591.

    [332] MOHANTY S, NYGREN E, LAURENT F, et al. Flatland-RL: multi-agent reinforcement learning on trains[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2012.05893.

    [333] SUN P, XIONG J C, HAN L, et al. Tleague: a framework for competitive self-play based distributed multi-agent reinforcement learning[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2011.12895.

    [334] ZHOU M, WAN Z Y, WANG H J, et al. MALib: a parallel framework for population-based multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.07551.

    作者簡(jiǎn)介

    羅俊仁(1989—),男,博士研究生,主要研究方向?yàn)槎嘀悄荏w學(xué)習(xí)、智能博弈。

    張萬(wàn)鵬(1981—),男,研究員,博士,主要研究方向?yàn)榇髷?shù)據(jù)智能、智能演進(jìn)。

    蘇炯銘(1984—),男,副研究員,博士,主要研究方向?yàn)榭山忉屓斯ぶ悄?、智能博弈?/p>

    袁唯淋(1994—),男,博士研究生,主要研究方向?yàn)橹悄懿┺摹⒍嘀悄荏w強(qiáng)化學(xué)習(xí)。

    陳 璟(1972—),男,教授,博士,主要研究方向?yàn)檎J(rèn)知決策博弈、分布式智能。

    猜你喜歡
    建模智能策略
    聯(lián)想等效,拓展建?!浴皫щ娦∏蛟诘刃?chǎng)中做圓周運(yùn)動(dòng)”為例
    例談未知角三角函數(shù)值的求解策略
    我說(shuō)你做講策略
    智能前沿
    文苑(2018年23期)2018-12-14 01:06:06
    智能前沿
    文苑(2018年19期)2018-11-09 01:30:14
    智能前沿
    文苑(2018年17期)2018-11-09 01:29:26
    智能前沿
    文苑(2018年21期)2018-11-09 01:22:32
    基于PSS/E的風(fēng)電場(chǎng)建模與動(dòng)態(tài)分析
    電子制作(2018年17期)2018-09-28 01:56:44
    不對(duì)稱半橋變換器的建模與仿真
    高中數(shù)學(xué)復(fù)習(xí)的具體策略
    两个人的视频大全免费| av女优亚洲男人天堂| 欧美日韩精品成人综合77777| 高清午夜精品一区二区三区| 内射极品少妇av片p| 十八禁网站网址无遮挡 | 欧美精品一区二区大全| 天天操日日干夜夜撸| 大陆偷拍与自拍| 欧美+日韩+精品| kizo精华| 欧美变态另类bdsm刘玥| 免费少妇av软件| 欧美一级a爱片免费观看看| 国产视频内射| 成年av动漫网址| 亚洲精品一二三| 国产有黄有色有爽视频| 97超视频在线观看视频| 久久韩国三级中文字幕| 九九爱精品视频在线观看| 午夜福利,免费看| 欧美日韩一区二区视频在线观看视频在线| 精品久久久久久电影网| 插阴视频在线观看视频| 亚洲国产精品国产精品| 亚洲欧美日韩东京热| 久久女婷五月综合色啪小说| 桃花免费在线播放| 精品久久久精品久久久| 免费黄频网站在线观看国产| 人妻夜夜爽99麻豆av| 日韩大片免费观看网站| 中文字幕av电影在线播放| 欧美丝袜亚洲另类| 精品少妇黑人巨大在线播放| 99热国产这里只有精品6| 天堂8中文在线网| 久久精品国产亚洲网站| 99re6热这里在线精品视频| 汤姆久久久久久久影院中文字幕| 偷拍熟女少妇极品色| 丝袜脚勾引网站| 日本欧美视频一区| 久久热精品热| 一级a做视频免费观看| 夜夜骑夜夜射夜夜干| 国产乱人偷精品视频| 色5月婷婷丁香| 亚洲av中文av极速乱| 欧美精品国产亚洲| 久久久久国产精品人妻一区二区| 蜜桃在线观看..| 日韩人妻高清精品专区| av一本久久久久| 国产成人aa在线观看| 精品视频人人做人人爽| 亚洲一区二区三区欧美精品| 97在线视频观看| 丝袜脚勾引网站| a级一级毛片免费在线观看| 国产成人精品无人区| 黄色毛片三级朝国网站 | 亚洲av中文av极速乱| 伦理电影免费视频| 国产精品一区www在线观看| 青春草视频在线免费观看| 欧美区成人在线视频| 又大又黄又爽视频免费| av卡一久久| 国产91av在线免费观看| 高清午夜精品一区二区三区| 日韩av在线免费看完整版不卡| 日本午夜av视频| 99热网站在线观看| 国产黄片视频在线免费观看| 国产视频首页在线观看| 日韩强制内射视频| 国产精品久久久久久av不卡| 精品人妻熟女av久视频| 中文字幕亚洲精品专区| 午夜影院在线不卡| 日本91视频免费播放| 成年女人在线观看亚洲视频| 五月开心婷婷网| 国产精品.久久久| 制服丝袜香蕉在线| 日本午夜av视频| 欧美bdsm另类| 丰满人妻一区二区三区视频av| 精品久久国产蜜桃| 亚洲精品久久午夜乱码| 午夜福利视频精品| 热99国产精品久久久久久7| 各种免费的搞黄视频| 国产精品99久久99久久久不卡 | 伊人亚洲综合成人网| 美女中出高潮动态图| 亚洲欧美中文字幕日韩二区| 午夜福利网站1000一区二区三区| 在线亚洲精品国产二区图片欧美 | 国产亚洲一区二区精品| 亚洲,一卡二卡三卡| 亚洲欧美成人综合另类久久久| 亚洲欧洲精品一区二区精品久久久 | 一个人看视频在线观看www免费| 天堂中文最新版在线下载| 3wmmmm亚洲av在线观看| 能在线免费看毛片的网站| 丝袜在线中文字幕| 国产伦在线观看视频一区| 天天躁夜夜躁狠狠久久av| 26uuu在线亚洲综合色| 久久ye,这里只有精品| 欧美性感艳星| 一级黄片播放器| a级片在线免费高清观看视频| 国产女主播在线喷水免费视频网站| 欧美精品国产亚洲| 国产中年淑女户外野战色| 久久毛片免费看一区二区三区| 久久人妻熟女aⅴ| 亚洲av男天堂| 色婷婷久久久亚洲欧美| 亚洲三级黄色毛片| 久久99蜜桃精品久久| 亚洲精品国产成人久久av| 久久精品夜色国产| 五月开心婷婷网| 少妇人妻 视频| 人人澡人人妻人| 男人狂女人下面高潮的视频| 夫妻性生交免费视频一级片| 在线观看三级黄色| 韩国av在线不卡| 亚洲无线观看免费| 亚洲av成人精品一二三区| 嫩草影院入口| 80岁老熟妇乱子伦牲交| 亚洲激情五月婷婷啪啪| 久久人人爽av亚洲精品天堂| 高清毛片免费看| 三级经典国产精品| 色婷婷av一区二区三区视频| 人妻少妇偷人精品九色| 中国美白少妇内射xxxbb| 国产精品偷伦视频观看了| 欧美日韩视频高清一区二区三区二| 国产69精品久久久久777片| 亚洲久久久国产精品| 一级,二级,三级黄色视频| 国产精品久久久久久久电影| 中文字幕久久专区| 亚洲av.av天堂| 在线看a的网站| 亚洲激情五月婷婷啪啪| .国产精品久久| 中文资源天堂在线| 夜夜看夜夜爽夜夜摸| 成人毛片a级毛片在线播放| 国产精品无大码| √禁漫天堂资源中文www| 偷拍熟女少妇极品色| 亚洲精品国产成人久久av| 激情五月婷婷亚洲| 亚洲欧洲日产国产| 中文字幕av电影在线播放| 亚洲va在线va天堂va国产| 一级黄片播放器| 中文字幕免费在线视频6| 亚洲国产欧美日韩在线播放 | 中文欧美无线码| 99久久中文字幕三级久久日本| 国产亚洲欧美精品永久| 青春草亚洲视频在线观看| 日本av手机在线免费观看| 少妇猛男粗大的猛烈进出视频| 国产精品熟女久久久久浪| 免费在线观看成人毛片| 韩国高清视频一区二区三区| 亚洲精品乱码久久久v下载方式| 欧美一级a爱片免费观看看| 国产av一区二区精品久久| kizo精华| 成人亚洲欧美一区二区av| 欧美xxⅹ黑人| 丝瓜视频免费看黄片| 国模一区二区三区四区视频| 美女福利国产在线| 综合色丁香网| 视频中文字幕在线观看| 观看美女的网站| 成人影院久久| av国产精品久久久久影院| 精品少妇久久久久久888优播| av天堂久久9| 国产亚洲欧美精品永久| 日韩成人av中文字幕在线观看| 亚洲欧美清纯卡通| 麻豆乱淫一区二区| 亚洲图色成人| 国产黄片美女视频| 亚洲怡红院男人天堂| 激情五月婷婷亚洲| 亚洲欧美清纯卡通| 国产视频内射| 国产精品99久久99久久久不卡 | 亚洲一级一片aⅴ在线观看| 91久久精品国产一区二区成人| 久久久久久久大尺度免费视频| 久久精品国产鲁丝片午夜精品| 日韩电影二区| 五月开心婷婷网| 男女边摸边吃奶| 伊人亚洲综合成人网| 免费久久久久久久精品成人欧美视频 | 美女内射精品一级片tv| 国产 精品1| 日韩免费高清中文字幕av| 久久久精品94久久精品| 老司机亚洲免费影院| 最近2019中文字幕mv第一页| 国产精品麻豆人妻色哟哟久久| 免费看光身美女| 亚洲电影在线观看av| 3wmmmm亚洲av在线观看| 搡老乐熟女国产| 亚洲国产精品成人久久小说| 精品亚洲成a人片在线观看| 久久女婷五月综合色啪小说| 免费大片18禁| 精品一品国产午夜福利视频| 国产精品福利在线免费观看| 成年人午夜在线观看视频| 看免费成人av毛片| 男女国产视频网站| 国产高清三级在线| av福利片在线| 国产精品国产三级国产专区5o| 成人亚洲欧美一区二区av| 丰满迷人的少妇在线观看| 精品人妻一区二区三区麻豆| 26uuu在线亚洲综合色| 亚洲人与动物交配视频| h日本视频在线播放| 欧美少妇被猛烈插入视频| 欧美3d第一页| 男人舔奶头视频| 午夜福利影视在线免费观看| 国产黄片美女视频| av在线播放精品| 黄色欧美视频在线观看| 精品久久久久久电影网| 久热久热在线精品观看| 久久热精品热| 2021少妇久久久久久久久久久| 亚洲在久久综合| 七月丁香在线播放| 久久精品国产a三级三级三级| 只有这里有精品99| 亚洲欧美中文字幕日韩二区| 欧美xxⅹ黑人| 18禁在线无遮挡免费观看视频| 中国美白少妇内射xxxbb| 爱豆传媒免费全集在线观看| 黑人巨大精品欧美一区二区蜜桃 | 欧美xxⅹ黑人| 一本久久精品| 亚洲中文av在线| 激情五月婷婷亚洲| 美女中出高潮动态图| 亚洲国产av新网站| 一级av片app| 国产精品久久久久久久电影| av国产久精品久网站免费入址| 午夜福利,免费看| 高清av免费在线| 自线自在国产av| 街头女战士在线观看网站| 在现免费观看毛片| 美女脱内裤让男人舔精品视频| 国产黄片视频在线免费观看| 日韩成人伦理影院| 欧美另类一区| 久久婷婷青草| 又粗又硬又长又爽又黄的视频| 99热这里只有精品一区| 3wmmmm亚洲av在线观看| 性色av一级| 伊人亚洲综合成人网| 国语对白做爰xxxⅹ性视频网站| 国产探花极品一区二区| 亚洲精品国产色婷婷电影| 久久久久久久亚洲中文字幕| 九九久久精品国产亚洲av麻豆| 国产精品久久久久久精品古装| 交换朋友夫妻互换小说| 亚洲四区av| 亚洲中文av在线| 亚洲情色 制服丝袜| 国产欧美日韩精品一区二区| 色吧在线观看| 精品亚洲成国产av| www.色视频.com| 中文字幕制服av| 亚洲精品乱码久久久久久按摩| 男人和女人高潮做爰伦理| 伦理电影大哥的女人| 色哟哟·www| 在线观看一区二区三区激情| 女人精品久久久久毛片| 中文字幕人妻熟人妻熟丝袜美| 国产精品成人在线| 国产黄色视频一区二区在线观看| 久久热精品热| 狂野欧美白嫩少妇大欣赏| 女人久久www免费人成看片| 视频中文字幕在线观看| 在线精品无人区一区二区三| 日韩强制内射视频| 欧美变态另类bdsm刘玥| 国产亚洲午夜精品一区二区久久| 欧美精品高潮呻吟av久久| 日本vs欧美在线观看视频 | 亚洲怡红院男人天堂| 欧美日韩亚洲高清精品| 只有这里有精品99| 欧美丝袜亚洲另类| 男女免费视频国产| 国产高清国产精品国产三级| av线在线观看网站| 国产男女超爽视频在线观看| 啦啦啦视频在线资源免费观看| 久久久久久久久久成人| av天堂久久9| 国产在线男女| 国产亚洲欧美精品永久| 成年女人在线观看亚洲视频| 18禁裸乳无遮挡动漫免费视频| 久热久热在线精品观看| 久久精品国产自在天天线| 亚洲欧美日韩另类电影网站| 亚洲久久久国产精品| 成人漫画全彩无遮挡| 一级毛片aaaaaa免费看小| 免费不卡的大黄色大毛片视频在线观看| 亚洲国产欧美日韩在线播放 | 国产日韩欧美亚洲二区| 极品人妻少妇av视频| 欧美成人午夜免费资源| 日本vs欧美在线观看视频 | 色哟哟·www| 国产淫语在线视频| 国产男女内射视频| 另类亚洲欧美激情| 亚洲国产精品一区三区| 国产毛片在线视频| 青青草视频在线视频观看| 国产精品人妻久久久久久| 黑人猛操日本美女一级片| 美女中出高潮动态图| 又粗又硬又长又爽又黄的视频| 免费少妇av软件| 777米奇影视久久| 18禁动态无遮挡网站| 午夜av观看不卡| 免费看日本二区| 久久99精品国语久久久| 秋霞在线观看毛片| 国产亚洲5aaaaa淫片| 国产 精品1| 美女大奶头黄色视频| 国产老妇伦熟女老妇高清| av福利片在线观看| 最后的刺客免费高清国语| 黄色一级大片看看| 男男h啪啪无遮挡| 看免费成人av毛片| 久久午夜福利片| 亚洲成人av在线免费| 嫩草影院入口| av.在线天堂| 亚洲av免费高清在线观看| 日本免费在线观看一区| 曰老女人黄片| 亚洲精品久久久久久婷婷小说| 最近的中文字幕免费完整| 日韩欧美 国产精品| 欧美另类一区| 国产伦在线观看视频一区| 久久久久久久精品精品| 国产成人精品福利久久| 国产av国产精品国产| 99re6热这里在线精品视频| 久久99热这里只频精品6学生| 免费大片黄手机在线观看| 有码 亚洲区| 精品一区二区三卡| 狂野欧美激情性xxxx在线观看| 18+在线观看网站| 美女xxoo啪啪120秒动态图| 亚洲久久久国产精品| 国产高清有码在线观看视频| 欧美日本中文国产一区发布| 一区二区三区免费毛片| 亚洲国产精品一区二区三区在线| 久久精品夜色国产| 成人黄色视频免费在线看| 亚洲国产成人一精品久久久| 亚洲av.av天堂| 在线观看一区二区三区激情| 一级毛片电影观看| 国产成人午夜福利电影在线观看| 久久综合国产亚洲精品| 欧美日本中文国产一区发布| 99九九在线精品视频 | 麻豆乱淫一区二区| 久久久亚洲精品成人影院| videossex国产| 人人妻人人看人人澡| 国产成人91sexporn| 一级av片app| 国产亚洲5aaaaa淫片| 亚洲精品第二区| 国产av码专区亚洲av| 卡戴珊不雅视频在线播放| 黄色欧美视频在线观看| 深夜a级毛片| 午夜91福利影院| 成人午夜精彩视频在线观看| 日本猛色少妇xxxxx猛交久久| 日韩视频在线欧美| 自拍欧美九色日韩亚洲蝌蚪91 | 2022亚洲国产成人精品| 少妇猛男粗大的猛烈进出视频| a级一级毛片免费在线观看| 高清av免费在线| 亚洲第一av免费看| 日本vs欧美在线观看视频 | 欧美人与善性xxx| 777米奇影视久久| 麻豆乱淫一区二区| av网站免费在线观看视频| 好男人视频免费观看在线| 久久久久久久久久成人| 热re99久久国产66热| av.在线天堂| 国产老妇伦熟女老妇高清| 这个男人来自地球电影免费观看 | 久久鲁丝午夜福利片| 大片免费播放器 马上看| 熟女电影av网| a 毛片基地| 国产精品国产三级专区第一集| 国产亚洲一区二区精品| 日本wwww免费看| 久久久国产欧美日韩av| 精品久久久久久久久av| 亚洲在久久综合| 久久午夜福利片| 高清黄色对白视频在线免费看 | 日韩制服骚丝袜av| 亚洲国产毛片av蜜桃av| 在线观看国产h片| 伊人久久精品亚洲午夜| 午夜老司机福利剧场| 永久网站在线| 99热这里只有是精品在线观看| 少妇丰满av| 久久久久久久亚洲中文字幕| 久久午夜综合久久蜜桃| 亚洲av二区三区四区| 日韩亚洲欧美综合| 国产亚洲5aaaaa淫片| 国产91av在线免费观看| 国产一区二区在线观看av| 亚洲av成人精品一区久久| 男女国产视频网站| 国产精品.久久久| 婷婷色综合www| 久久热精品热| 亚洲国产成人一精品久久久| 国产成人91sexporn| 成人美女网站在线观看视频| 国产亚洲午夜精品一区二区久久| 国产成人freesex在线| 亚洲成人手机| 久久热精品热| 久久久久久久国产电影| 国产国拍精品亚洲av在线观看| 久久这里有精品视频免费| 久久久久久久精品精品| 国产在线免费精品| 黄色毛片三级朝国网站 | 日韩免费高清中文字幕av| 91午夜精品亚洲一区二区三区| 18禁动态无遮挡网站| 亚洲精品色激情综合| 国产午夜精品一二区理论片| 国产成人a∨麻豆精品| 成人毛片60女人毛片免费| 国产日韩欧美视频二区| 精品一区二区免费观看| 成年女人在线观看亚洲视频| 啦啦啦中文免费视频观看日本| 亚洲电影在线观看av| 青春草国产在线视频| 又大又黄又爽视频免费| 国产日韩欧美在线精品| 国产毛片在线视频| 午夜老司机福利剧场| 亚洲一区二区三区欧美精品| 老熟女久久久| 国产国拍精品亚洲av在线观看| 欧美老熟妇乱子伦牲交| 欧美日韩国产mv在线观看视频| 天天躁夜夜躁狠狠久久av| 亚洲精品国产av蜜桃| 国产国拍精品亚洲av在线观看| 成年女人在线观看亚洲视频| 97在线视频观看| 午夜福利,免费看| 人妻人人澡人人爽人人| 久久人妻熟女aⅴ| 老司机影院成人| 欧美日韩一区二区视频在线观看视频在线| 三级国产精品片| 亚洲av男天堂| 亚洲国产精品999| 18禁在线无遮挡免费观看视频| 日韩人妻高清精品专区| 蜜桃久久精品国产亚洲av| 亚洲av福利一区| 熟女av电影| 丰满少妇做爰视频| 自拍偷自拍亚洲精品老妇| 精品亚洲成a人片在线观看| 国产精品一区二区在线不卡| 免费av不卡在线播放| 欧美精品亚洲一区二区| 亚洲精品,欧美精品| 日本91视频免费播放| 亚洲情色 制服丝袜| 中文字幕人妻丝袜制服| 国产精品久久久久久精品古装| 亚洲精品色激情综合| 国产探花极品一区二区| 五月天丁香电影| 自拍偷自拍亚洲精品老妇| 日本91视频免费播放| 人人澡人人妻人| 黑丝袜美女国产一区| 日韩一区二区三区影片| 18禁裸乳无遮挡动漫免费视频| 亚洲,欧美,日韩| 男人和女人高潮做爰伦理| 婷婷色综合大香蕉| 久久这里有精品视频免费| 六月丁香七月| 蜜臀久久99精品久久宅男| 亚洲精品国产av蜜桃| 我的女老师完整版在线观看| 国产精品99久久久久久久久| 女性被躁到高潮视频| 26uuu在线亚洲综合色| 99久久精品一区二区三区| 久久6这里有精品| 日本黄大片高清| 韩国av在线不卡| 亚洲一级一片aⅴ在线观看| 一个人看视频在线观看www免费| 中文乱码字字幕精品一区二区三区| 午夜av观看不卡| 久久精品熟女亚洲av麻豆精品| av卡一久久| 欧美亚洲 丝袜 人妻 在线| 十八禁高潮呻吟视频 | 国产精品一区二区三区四区免费观看| 下体分泌物呈黄色| 99久久精品热视频| 人妻 亚洲 视频| av卡一久久| 亚洲真实伦在线观看| 国产日韩欧美在线精品| 一二三四中文在线观看免费高清| 亚洲av欧美aⅴ国产| 亚洲国产精品国产精品| 在线观看美女被高潮喷水网站| 午夜福利网站1000一区二区三区| 亚洲不卡免费看| 夫妻午夜视频| 国产av码专区亚洲av| 欧美 日韩 精品 国产| 黑人猛操日本美女一级片| 熟女电影av网| 女的被弄到高潮叫床怎么办| 精品久久国产蜜桃| 人体艺术视频欧美日本| 妹子高潮喷水视频| 国产精品三级大全| 国产av码专区亚洲av| 丰满饥渴人妻一区二区三| 亚洲一级一片aⅴ在线观看| 看免费成人av毛片| 久久久久网色| 99久久精品热视频| 99re6热这里在线精品视频| 男人和女人高潮做爰伦理| 亚洲美女视频黄频| videos熟女内射| 国产av国产精品国产| 最新的欧美精品一区二区| 成人综合一区亚洲| videossex国产| 一区在线观看完整版|