中圖分類號:029 文獻(xiàn)標(biāo)志碼:A DOI: 10.19907/j.0490-6756.2025.240168
Intelligent path planning for small modular reactors based on improved reinforcement learning
DONG Yun-Feng,ZHOU Wei-Zheng,WANG Zhe-Zheng,ZHANGXiao(SchoolofMathematics,SichuanUniversity,Chengdu 6loo65,China)
Abstract: Small modular reactor (SMR) belongs to the research forefront of nuclear reactor technology. Nowadays,advancement of intelligent control technologies paves a new way to the design and build of unmanned SMR.The autonomous control processof SMR can be divided into three stages,say,state diagno sis,autonomous decision-makingand coordinated control. In this paper,the autonomous state recognition and task planning of unmanned SMR are investigated.An operating condition recognition method based on the knowledge base of SMR operation is proposed by using the artificial neural network (ANN) technology, which constructs a basis for the state judgment of intellgent reactor control path planning.An improved reinforcement learning path planning algorithm is utilized to implement the path transfer decision-makingThis algorithm performs condition transitions with minimal cost under specified modes.In summary,the fullrange control path inteligent decision-planning technology of SMR is realized,thus provides some theoretical basis for the design and build of unmanned SMR in the future.
Keywords:Smallmodular reactor;Operating condition recognition; Path planning;Reinforcement learning
1 Introduction
Small modular reactor(SMR)has many advantages such as low power density,small size, short construction period and high safety performance[1]. In some long-term unmanned environments,the stability of SMRis crucial.On the other hand,currently many operations in SMR still heavilyrelyonmanual labor,whichinevitablyleads to errors and degraded stability.When it comes to condition monitoring or operating condition recognition of SMR,subtle or rapid changes often make it dificult for the operator to interpret the trends in interacting variables.As a result,erroneous decisions couldbemade dueto the lack ofknowledge on real operational status of SMR [2].
Unmanned SMR(USMR)can be deployed in remote areasrarely visited by humans or even in deep-sea and deep-space environments.Nowadays, SMR needsto evolvetoward higherlevelsofauto mation and intelligence,and the industry urgently requires breakthroughs in the key technologies of SMRs that are applicable across various environments with safety and stability[3].
The basic operations adopted for reactor conditionmonitoringin SMR both domesticallyand internationally follow the human-machine collaboration mode.That is to say,machines autonomously con trol the system according to predefined paths,while humans supervise and analyze the control system’s performance in real time.When deviations from expected outcomes or the loss of system ability to maintain automatic operation aredetected,manual intervention is needed to ensure the system resumes the normal operation or to ensure the device enters a relatively safe state.Itis extensively believed that the core role of humans throughout the operation processisinformationanalysisand decisionmaking,which leads to the idea of replacing humanswith machines.
Inrecent years,with the revolution of artificial intelligence(AI),industry has introduced various control methods into the nuclear reactor control sys tems[1],such as PID control[4-6],intelligent control[-12],compositecontrol[13-15],etc.,which createspartial conditions for theunmanned operationof SMRs.To achieve the stable operation of SMRs,it is crucial to possess capabilities for cognitive analy sis of operating condition of the system and decisionmaking for task planning.
Operating condition refers to the current working state of SMR. In the intelligent path planning decision across the entire spectrum of SMR,the capa bility for cognitive analysis of operating condition refers to the measurement and monitoring of equip ment operational status information and characteristicparameters.This includesthe data such astemperature,cycle,power,pressure,etc.,todeter mine whether or not the system is operating nor mallyand to identify the current operating condition of SMR.This can be thought as an operating condition recognition problem.
Alternatively,path planning problems are es sentially involved in the task-planningand decisionmaking capabilities of SMR.Specifically,it refers to the problem of path planning in finite directed graphs and the combinatorial optimization problem. In SMR,every operating condition can be thought as a node in graph,and the transition between any two operating conditions can be thought as a connection between the nodes in graph. The ultimate goal is to solve the path planning problem for given requirements.
In this paper,to address the challenges of oper atingcondition recognition and path-planning in the intelligent planning of full-range control paths for SMR,we utilize a feedforward neural network (FNN) for the operating condition recognition and propose an improved reinforcement learning path planning algorithm based on lattice theory. This ap proach enables the intelligent planning of SMR.
2 Algorithms and models
There are both safe and accident operating con ditions in SMR. It is necessary to categorize specific types of operating conditions,such as \"startup\", \"poweroperation\",\"powertransition\",\"shutdown\", etc.,as typical operating conditions[16] and accident operating conditions such as \"breaching accident\"[7. 18] and \"loss of flow\" [19]. There are hundreds of specific operating conditions[20]. To enable the intelligent op eration of SMR,real-time operating condition recognition is imperative[20-22].
Theoperating condition is not static during the operation of SMR. Depending on the changes in op erational requirements,it may be necessary to transition the reactor from its current state to a target state.When operational requirementschange,the first step is to assess the current state,then plan a transition path based on the relationship between different operating conditions and the transition requirement. The transition process has three key features.
(i) Not all operating condition nodes are necessarily connected. (ii)There may be multiple directed paths be tween two nodes. (iii)Each directed path may have various weight parameters such as time,energy and over shoot. To minimize a weight corresponds to three transition modes,say,economic mode,maneuver modeand stablemode.
2.1 Operatingconditionrecognition
To address the issue of operating condition recognition in intelligentplanning of full-range control paths fora reactor,weutilize an artificial neural network(ANN) algorithm.ANN constructs its networkbyusing neuron asnode.Especially,feedfor ward neural network(FNN) processes a relatively straightforward topology[23],which consists of an input layer,several hidden layers and an output layer, asshown in Fig.1.Each layer comprises multiple neuron nodes.During the training process,F(xiàn)NN adjusts its connection weights by using the backpropagation algorithm. By adjusting the parameters within neurons,F(xiàn)NN repeatedly fits the input data to the output data. Nowadays,F(xiàn)NNs have been ap plied in diverse areas such as function approximation,pattern recognition and time series prediction.
In SMR,various sensors constantly generate data reflecting the current state of system,including the parameters such as temperature,cycle,power, pressure,etc.,amounting to dozens of parameters. In the state recognition model of SMR,the input layer receives parameters from various sensors.Subsequently,by setting appropriate numbers of layers and neurons in the hidden layer and determining the activation function of neurons,the output layer pro vides the predicted operating condition of system.
2.2 Path planning
According to the research status of path plan ning,we use the reinforcement learning algorithm[24], genetic algorithm[25,26]or Dijkstra algorithm[27] to solve the path planning and combinatorial optimizationin finitedirected graph.However,the complexityofoperational parameters in reactor indicates that the known path planning or combinatorial optimization algorithms cannot cover full range of the needed control path planning decisions.Besides,large volume of computations lead to low computational efficiency.In this paper,to implement the intelligent planning decision algorithm for control paths and the optimal path control for operational status transitions efficiently,we choose a reinforcement learning pathplanningalgorithm.
Reactor operation planning can be described as: When the system receives the planning request to thetarget operationstate,it firstlyidentifythecur rent state,then find the path with the least weight ofthe planned operation mode in the transition environment,and finally transform to the target state throughthese intermediatestates.
Reinforcement learning is a machine learning method that maps environmental statesto actions.It can autonomously learns how to make correct decisions through interaction with the environment to maximize the expected cumulative rewards.Traditionally,reinforcement learning path planning algo rithm refers to a class of algorithms using reinforcement learning method to solve path planning problem and finding the optimal path from a starting point to an endpoint in a given environment.
Q-Learning isa value function based on reinforcement learning that selects the best action by learning the value function of state-action pair, where the state can be position on a map and action canbe movement such asup,down,left orright. TheQ-value update formula for reinforcement learning is
Through interacting with the environment to seek themaximum value reward,the algorithm ultimately finds the shortest path from current node to the target node.
Reinforcement learning path planningalgo rithm based on lattice theory requires the definition of a partially ordered set and applies it to path graph.The algorithm refers to the definition of lat tice by making each optimal path as a lattice. Through the definition of sub-lattice,thealgorithm decomposes the optimal paths obtaining from each reinforcement learning iteration into a collection of sub-lattices for storage.
Apartial orderisabinaryrelationonasetthat allows to compare elements within the set. However,itdoesnotberequiredthatallelementsinthe set can be compared in terms of magnitude.The definition of partially ordered set,lattice and sublatticecan be found in Ref.[28].Given a graph or path,with the starting point as the source and the endpointas the sink,if any node within satisfiesthe reflexivity,anti-symmetry and transitivitythen the set of all nodes in this path and their relation of proximityto the endpoint forma partially ordered set.
Each path obtained from the path planning algo rithm based on reinforcement learning can be describedasalattice.For example,considerthe setof nodes X={1,2,3,4,5} . Assume that the starting pointis node 1 and the endpoint is node 5.By the definition of partial order,the leastupper bound and the greatest lower bound of any two nodes belong to the set,and this structure can be thought as a lattice.Meanwhile, the node sets such as X1= {1,2,3,4} , X2={2,3,4,5} and X3={2,3,4} are sub-lattices.
When a path is obtained in reinforcement learning process within the action space,the corresponding sub-lattice set is decomposed.Each sub-lattice is stored in accordance with specific requirements. When there is a corresponding requirement in the next project,reinforcement learning will no longer berepeated and the path with the shortest weight will be directly output in the path set.
According to the lattice theory,the path plan ning based on reinforcement learning can be im proved.Here we propose the algorithm of reinforcement learning path planning algorithm based on lattice theory.
vironmental map.If there isa requirement from node1 to node 3,then the traditional algorithm gen eratesthepaths[1,2,3],[1,2,4,3]and[1,4,3]. Finally,thesub-path[2,4],whichis outsidethe initial requirement,will be obtained.If we require a path from node 2 to node 4 in the future,the path planning based on traditional algorithm will not be used again.If the next demand is from node 1 to node 4,then the reinforcement learning generates paths[1,4],[1,2,4],[1,2,3,4],then thepath [2,3,4],which is outside the requirement,will be generated.This enriches the path collection from the current node 2 to node 4,gradually approaching the optimal solution.
Traditional path planning algorithm cannot always generate the optimal solution,especially when the number of training episodes is lower than the number of possible actions in the action space.On the other hand,conducting reinforcement learning on other demands and simultaneously generating sub-lattice collections for existing demandscan graduallyenrich thepath resultsforalldemands.
In comparison to the traditional path planning algorithm,the reinforcement learning path planning algorithm based on lattice theory has two advantages.
(i)Faster calculation speed.After certain re quirementiscompleted,itcanbedirectlyaccessed at any time without the need to use traditional path planning algorithm to find the optimal path every time.
(ii) Self-learning capability. When a specific demand is processed,it can plan paths for all sublattices corresponding to the demand.
3 Simulation analysis
3.1Operatingconditionrecognition
The simulation uses MATLAB as the basic framework and is developed based on MATLAB 202la.The simulation data for operating condition recognition was provided by the project (SCUamp; DRSI-LHCX-6) of this fund(JointInnovation Found of Sichuan University and Nuclear Power In stitute of China).The simulation system generates 1O data points per second,thus results in a total of 291 084 time series data points.Each data point in cludes 38 sensor parameters,including tempera ture,pressure,powerandcycle.Additionally,each data point isassociated with a label derived from expert experience.These operating condition labels represent seven actual operating conditions in a SMR,such as“startup”,“self-sustaining” and “l(fā)ow power”,which is denoted as“operating condition 1”to“operating condition 7”,respectively. In the original data,from“operating condition 1”to “operating condition7”,thereare 1853,1546,18 184,154 251,28 200,35 799 and 51 251 labeled data,respectively.These simulated dataisused for neural network training and testing.
In this paper,the following operations are carried out on the data successively.
(i)Theparameterdata isstandardized and ran domly sorted. (ii) 80% of the data samples are used as the training set, 20% as the test set. (iii)Use SMOTE algorithm to increase the data of“operating condition 1”and“operating condition 2”in the training set to the average level. (iv)Setting of the FNN. The hidden layeris 50*50*50 ,in which the hidden layer activation function is selected“tansig”and the output layer is selected“purelin\".Gradient updating isselected “SCG\". The optimization function is MSE.The number of iterations is15oO epochs.The computa tion isperformed on GPU.
(v)The test set is brought into the trained FNN for testing.
According to the simulation,the total recognition accuracy is 99.9 811% ,which is a very good recognition effect. The convergence resultsare shown inFig.3.Themean square error(MSE)of the training set gradually decreases with the training progresses.With the gradual learning and optimiza tion of the algorithm,after 15OO training cycles,the bestperformanceM ,which shows a good convergence effect.
The recognition results are shown in Fig. 4, which is a multi-classification confusion matrix rep resenting the recognition effect of FNN.The horizontal and vertical coordinates represent 7 operating condition types,with rows representing prediction labels and columns representing real labels.In
Fig.4,the values are primarily concentrated along the diagonal of the confusion matrix,while there are onlya fewbitsand piecesof numbers elsewhere,in dicating that it has good operating condition recognition ability.
Tab.1 lists the evaluation indicators of various operatingconditions,whichisdescribed from the indicators of precision,accuracy,recall and F1-Score performance.The closer an indicator to 1,the betterthe recognition effectis.The simulation results show that all the parameters are greater than O.97, as shown in Tab.1. Among them,the various evaluation indexes of“operating condition 5′′ and \"operating condition 6′′ are 1,which has a strong recognition ability.
In conclusion,Use FNN to identify the current operatingconditionisprettyeffectiveand canoffera basis for the next operating condition transfer.
3.2 Path planning simulation
In nuclear reactor control,the complex work ing environment,long transfer time and variable transfer operating condition can cause the demands to change in real time during the engineering conversion process.Based on the operating condition recognition,we resort to the reinforcement learning path planning algorithm based on lattice theory to planthe operating condition transition.
The simulation is also developed based on MATLAB 2O2la. Set 5 typical operating conditions in Fig. 5. The figure contains5 nodes and 2O paths, each connection corresponds to three weights,and the weights in the simulation are random numbers.
In the simulation,we set the exploration rate inreinforcementlearningat O.2,thediscount factor at O.98,thelearningrateat O.OOl,and the number of training sessions at lO Ooo.Assume that we have several operating conditions transfer tasks successively,that is,
Task 1: Node 1 is converted to node 5; Task2:Node 4isconverted to node2; Task3:Node3istransformedtonode5. The reinforcement learning path planning algo rithmbased on lattice theoryand the traditional rein forcement learning path planning algorithm are simu lated under the same setting.
According to Algorithm 2.1,when the new mapruns forthe firsttime,itwillinvokethe reinforcement learning algorithm to find the path.Therefore,the running time of both algorithms is basically the same.In Tab.2,the next node planned is node 2,and when the demand changesto theeconomicmode,therunningtime ofthereinforcement learning path planning algorithm based on lattice theoryis much lower than that of the traditional rein forcement learning path planning algorithm,be cause the path from node 2 to node 5 has been planned by the reinforcement learning path planning algorithm based on lattice theoryin thelast demand.
InTab.3,Task2isexecutedafterTask1. SinceTask1doesnotgeneratethepathforTask2, Algorithm 2.1 must invoke reinforcement learning to find the required path,and the running time of the two algorithms is basically the same.
InTab.4,Task3isexecutedafterTask2, first switch to node 4in maneuvermode,and then switch to node 5 in economy mode. The simulation results show that the running time ofAlgorithm 2.1 ismuchlowerthanthatofthetraditionalreinforcement learningpath planning algorithm.
Theabove simulation results indicate that, compared to the traditional reinforcement learning path planning algorithm,the Algorithm 2.1 does not require invoking the reinforcement learning algo rithm for each demand,thus results in lower compu tational complexity,reduced runningtime andbetter alignment with the path planning requirements of engineering nuclear reactors.When the number of nodes is particularly large and the number of training times in reinforcement learning is much lower than the number of action spaces,the path results will converge more and more to the global optimal with the increase of the algorithm running times, andthe advantage of Algorithm 2.1 will become more and more obvious in self-learning.
4 Conclusions
The development of SMR that can adapt to various unmanned environments is of great strategic significance.Furtherresearch aimsat the safety,reliability and intellectuality of SMR is urgently needed.FNN is used to endow the reactor state cognitive analysis ability and reinforcement learning path planning algorithm based on lattice theoryis used to endow the reactor task planning decisionmaking ability in this paper.According to the simulation results,the data generated by reactor sensors and the expert knowledge base tags are used for trainingand simulation theFNNin orderto obtain strong recognition capability even in the reactor's complex system with strong couplingand nonlinearity.
The operating condition conversion relation is described as a unidirectional graph. By combining lattice theory with the traditional reinforcement learning path planning algorithm,a reinforcement learning path planning algorithm based on lattice theory is proposed.The results of simulation indicate that,in comparison with the traditional reinforcement learning path planning algorithms,the newalgorithm has lower complexity and can meet the technical requirements of nuclear reactor engineering better.
References:
[1] ZhangWW,HeZX,WanXS,etal.Areviewonthecontrolmethodsin small modularreactors[J].JSichuanUniv:NatSciEd,2024,61:020001.[張薇薇,何正熙,萬雪松,等.小型模塊化反應(yīng)堆控制方
法綜述[J].四川大學(xué)學(xué)報(bào)(自然科學(xué)版),2024,61: 020001.]
[2]Chen Z,Wang PF,Liao L T,et al. Inteligent con-trol of small pressurized water reactor [M]. Shang-hai:Shanghai Jiaotong University Press.2O22.[陳智,王鵬飛,廖龍濤,等.小型壓水反應(yīng)堆智能控制[M].上海:上海交通大學(xué)出版社,2022.]
[3]Zhang B W.Research on key technologies of autono-mous control for small modular reactor[D].Harbin:Harbin EngineeringUniversity,2O2O.[張博文.小型模塊化反應(yīng)堆自主控制關(guān)鍵技術(shù)研究[D].哈爾濱:哈爾濱工程大學(xué),2020.]
[4] Wang QQ,Yin C C, Sun XJ,et al.PID design andsimulation of TMSR nuclear power control sys-tem[J].NuclTech,2015,38:58.[汪全全,尹聰聰,孫雪靜,等.TMSR核功率控制系統(tǒng)的PID設(shè)計(jì)與仿真[J].核技術(shù),2015,38:58.]
[5] Wang XK,YangXH,LiuG,et al.Adaptiveneuro-fuzzy inference system PID controller for SGwater level of nuclear power plant [C]//2O09 Inter-national Conference on Machine Learning and Cyber-netics.Piscataway:IEEE,2009.
[6]Yong E L.Robust H∞ Control approach to steamgenerator water level of CPRlOoo nuclear powerplant[D]. Harbin:Harbin Engineering University,2014.[雍二磊.CPR1000核電廠SG水位魯棒 H∞ 控制方法研究[D].哈爾濱:哈爾濱工程大學(xué),2014.]
[7] Bartlett E B,Uhrig R E.Nuclear power plant statusdiagnostics using an artificial neural network[J].Nucl Technol,1992,97:272.
[8] GernothKA,Clark JW,PraterJS,et al.Neuralnetwork models of nuclear systematics[J].Phys LettB,1993,300:1.
[9] El-SefyM,YosriA,El-DakhakhniW,etal.Artificial neural network forpredictingnuclearpowerplantdynamic behaviors [J].Nucl Eng Technol,2021,53:3275.
[10]Santosh TV,Vinod G, Saraf R K,et al.Applica-tion of artificial neural networks to nuclear powerplant transient diagnosis[J].Reliab Eng Syst Safe,2007,92:1468.
[11]Ruan D,van der Wal A J.Controlling the power out-put of a nuclear reactor with fuzzy logic [J]. InformSciences,1998,110:151.
[12] Nelson W R. REACTOR:An expert system for di-agnosisand treatment of nuclear reactor acci-dents[J].AAAI,1982,82:296.
[13]ZengW,Jiang Q,Liu Y,et al. Core power controlof a space nuclear reactor based on a nonlinear modeland fuzzy-PID controller[J].Prog Nucl Energ,2021,132:103564.
[14]Liu C,Peng JF,Zhao FY,et al.Design and opti-mization of fuzzy-PID controller for the nuclear reac-tor power control[J].Nucl Eng Des,2OO9,239:2311.
[15]Wu P,Liu DC,Zhao J,et al. Fuzzy adaptive PIDcontrol-based load following of pressurized water re-actor[J].Power Sys Techno,2011,35:76.[吳萍,劉滌塵,趙潔,等.基于模糊自適應(yīng)PID控制的壓水堆負(fù)荷跟蹤[J].電網(wǎng)技術(shù),2011,35:76.]
[16]Du Z Y,Ma Y G, Zhong RC,et al. Analysis ofstart-up characteristics of heat pipe reactor [J]. NuclPower Eng,2023,44:67.[杜政璃,馬譽(yù)高,鐘睿誠,等.熱管反應(yīng)堆啟堆特性分析[J].核動(dòng)力工程,2023,44:67.]
[17]Yuan XB,Peng JQ,Zhang BH,et al.Analysis offission product source term release and decay heat un-derpressurized water reactor rupture accident [J].Nucl Tech,2024,47:149.[袁顯寶,彭玨欽,張彬航,等.壓水堆破口事故下裂變產(chǎn)物源項(xiàng)釋放及衰變熱分析[J].核技術(shù),2024,47:149.]
[18]Zhan JX,ZhengY T,Huang SL,et al. Study onfeed and bleed based on SBLOCA of HPR1O0O [J].Nucl SciEng,2024,44:142[詹經(jīng)祥,鄭云濤,黃樹亮,等.“華龍一號\"小破口事故充排研究[J].核科學(xué)與工程,2024,44:142.]
[19]Wang K,Yang JK,Zhao PC.Uncertainty analysisof loss of coolant flow accident in lead-bismuth reac-tor using subchannel code [J].Nucl Tech,2024,47:121.[王凱,楊俊康,趙鵬程.基于子通道程序的鉛鉍反應(yīng)堆失流事故不確定性分析[J].核技術(shù),2024,47:121.]
[ ZU」Mena P,Borrell K A,Kerdy L.Expanaea anaiysisof machine learning models for nuclear transient iden-tification using TPOT[J].Nucl Eng Des,2022,390: 111694.
[21]Zubair M,Akram Y.Utilizing MATLAB machinelearning models to categorize transient events in anuclear power plant using generic pressurized waterreactor simulator [J]. Nucl Eng Des,2O23,415:112698.
[22]Moshkbar-Bakhshayesh K,Ghofrani M B. Transientidentification in nuclear power plants:A review [J].Prog Nucl Energ,2013,67:23.
[23]Svozil D,Kvasnicka V,Pospichal J. Introduction tomulti-layer feed-forward neural networks [J].Che-mometr Intell Lab,1997,39:43.
[24]Xiong W B,Guo L,Jiao T Y.A multi-agent pathplanning algorithm based on game theory and rein-forcement learning [J].J Shenzhen Univ Sci Eng,2024,41:274.[熊文博,郭磊,焦彤宇.基于博弈論與強(qiáng)化學(xué)習(xí)的多智能體路徑規(guī)劃算法[J].深圳大學(xué)學(xué)報(bào)(理工版),2024,41:274.]
[25]ZhouR,LongW,LiYY,et al.Research on AGVpath planning for green remanufacturing systems [J].JSichuan Univ:Nat SciEd,2019,56:883.[周潤,龍偉,李炎炎,等.面向綠色再制造系統(tǒng)的AGV路徑規(guī)劃研究[J].四川大學(xué)學(xué)報(bào)(自然科學(xué)版),2019,56:883.]
[26]Xia Q,LeiY,Ye X Y.The application of genetic al-gorithm in global optimal path searching of AGV[J].J Sichuan Univ:Nat Sci Ed,2008,45:1129.[夏謙,雷勇,葉小勇.遺傳算法在AGV全局路徑優(yōu)化中的應(yīng)用[J].四川大學(xué)學(xué)報(bào)(自然科學(xué)版),2008,45:1129.]
[27]Zhou Z,Long H,Li S,etal.Electric vehicles charg-ing path planning method based on road network un-der multi parameters [J].J Sichuan Univ:Nat SciEd,2024,61:229.[周箏,龍華,李帥,等.多因素下基于路網(wǎng)拓?fù)涞碾妱?dòng)汽車充電路徑規(guī)劃策略[J].四川大學(xué)學(xué)報(bào)(自然科學(xué)版),2024,61:229.]
[28]Zhang ZL,Zhang JL,Ying YQ,et al.Fuzzy settheory and methods[M].Wuhan:Wuhan UniversityPress,2010.[張振良,張金玲,殷允強(qiáng),等.模糊集理論與方法[M].武漢:武漢大學(xué)出版社,2010.]
(責(zé)任編輯:周興旺)