Gang Li,Jingbo Miao,Zihou Wang,Yanni Han,Hongyan Tan,Yanwei Liu,Kun Zhai
1 Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China
2 School of Cyber Security,University of Chinese Academy of Sciences,Beijing 101408,China
3 National Computer Network Emergency Response Technical Team/Coordination Center of China,Beijing 100029,China
4 Sunwise Intelligent Technology Company,Beijing 100190,China
*The corresponding author,email:wangzh@cert.org.cn
Abstract:Mobile edge computing(MEC)is a cloud server running at the edge of a mobile network,which can effectively reduce network communication delay.However,due to the numerous edge servers and devices in the MEC,there may be multiple servers and devices that can provide services to the same user simultaneously.This paper proposes a userside adaptive user service deployment algorithm ASD(Adaptive Service Deployment)based on reinforcement learning algorithms.Without relying on complex system information,it can master only a few tasks and users.In the case of attributes,perform effective service deployment decisions,analyze and redefine the key parameters of existing algorithms,and dynamically adjust strategies according to task types and available node types to optimize user experience delay.Experiments show that the ASD algorithm can implement user-side decision-making for service deployment.While effectively improving parameter settings in the traditional Multi-Armed Bandit algorithm,it can reduce user-perceived delay and enhance service quality compared with other strategies.
Keywords:edge computing;adaptive algorithm;reinforcement learning;computing unloading;service deployment
With the widespread use of smart devices and the development of Internet of Things technology,the complexity of the computing tasks that mobile devices need to handle continues to increase[1].In the traditional cloud computing framework,users only accept cloud data as consumers of data.However,with the rapid increase in the amount of data generated by edge terminals and computing requirements,network bandwidth has become a bottleneck that limits the efficiency of the entire computing network[2].Mobile Edge Computing(MEC)is a cloud server that runs at the edge of a mobile network.It has high computing power and decentralizes the service provision capabilities initially concentrated in the network center to the edge network closer to users.Relying on the combination of a small edge computing platform and a mobile edge network to enhance the effect of user experience[3].However,due to a large number of edge servers and devices in the MEC,there may be multiple servers and devices that can provide services to the same user simultaneously.In addition,when the user moves,the computing resources used by related applications may switch between multiple edge nodes.Computing offloading is one of MEC’s key technologies[4,5],which can be used to migrate computingintensive tasks from mobile devices to MEC servers.The current research work can be roughly divided into two categories,strategies based on global system information and designs based on reinforcementlearning methods.The Table 1 lists the related classic works with their corresponding environmental settings and concerns.In the service strategy based on global system information,each node in the MEC network has accurate system-level information.Wang et al.[6]used user location distribution,preferences,system load,database location,and other information to predict and calculate the future data transfer,data processing,and service session transfer overhead.Yang L et al.[7]predicted the load that the user will request in the future based on the user’s mobile mode and other conditions.Nadembega A et al.[8]proposed a mobility-based service migration prediction framework MSMP and planned the data transmission sequence from the mobile data center to the user according to the user’s mobile mode.Ouyang T et al.[15]expressed the decision status of each node in different time slots as graph nodes and the service migration cost as edge weights.The above work requires mastering or predicting accurate user future mobility information based on historical information,but in a real environment.Among them,users often do not have a fixed-mobile mode,and it is challenging to predict user’s moving trends.Taleb T[9]used the twodimensional Markov decision process to analyze the service migration overhead.Ouyang T et al.[10]used Lyapunov optimization to transform the long-term optimization problem into a series of real-time Optimization.These methods are based on global system information,but for edge base stations or ordinary users,it is impossible to obtain complete system information at the edge of the network and execute user-side decisions for service deployment.
Table 1.Summary of two types of classic works for computing offloading.
The strategy based on the reinforcement learning method transforms the MEC service deployment strategy problem into a scenario similar to the Multi-Armed Bandit(MAB)problem.Kao YH et al.[11]made assumptions about the equipment and channel conditions and quickly learned the unpredictable resource availability and channel information in the dynamic environment of the task.Dai P et al.[13]proposed a method called UL(Utility-table based Learning).The Multi-Armed Bandit algorithm treats the MEC server in the transportation network as both the user and the selected rocker.Sun Y et al.[12]combined MAB problem theory and UCB algorithm to develop a learning-based task offloading framework.The above work ignores that the status of selected objects and user needs will change over time.Li L et al.[16]first proposed this new context-related Bandit algorithm LinUCB in 2014 to solve Yahoo’s news personalized recommendation problem.The return expectation of each object in the algorithm is related to the characteristics of the object.The vector satisfies the linear relationship.After the selection is made,the parameters of the linear function are updated according to the return value.Finally,the effect of the dynamic update of the selection strategy is realized.G.Nikolov et al.[14]proposed an improved version of the LinUCB algorithm,which was applied to the problem of wireless interface selection.The channel quality parameters were converted into estimated data rates,and interface selection was performed.Li T et al.[17]proposed privacy protection and task offloading scheme in MEC,which expresses the task assignment and privacy protection problems as semiparametric contextual Multi-Armed Bandit problems and then designs PAOTO(Privacy-Aware Online Task Offloading)based on the Thompson sampling architecture.The algorithm improves the optimal time delay and energy consumption without requiring systemlevel information to protect privacy.
Figure 1.ASD user service deployment structure.
This paper proposes a user-side adaptive user service deployment strategy.Based on the Multi-Armed Bandit algorithm in reinforcement learning,it can execute effective service deployment decisions.
Discretize the continuous timeline into time slotst∈={1,2,3,...,T}.The user chooses the MEC server to provide computing services;that is,computing offloading tasks occur.The process can be divided into four steps:1)Mobile users make task offloading or migration decisions based on their environment;2)Send task-related data to the MEC server;3)MEC server completes calculation tasks;4)The MEC server returns the calculation result to the mobile user.
The vectorw=[wt1,...,wtM,wtc,wtl]represents the dynamic service deployment decision of t time slot.Among them,wti(i=1,...,M),wtcandwtlrespectively indicate whether to select the i-node,cloud computing center,and user equipment to perform t-slot computing tasks.represents all MEC servers that can provide computing services,c represents a cloud computing center,and l represents a user’s local device,and=∪{c,l}represents all computing nodes that can perform the current task.Because each time slot user can only select only one object to perform the task,there are the following restrictions on the decision vector:
In the mobile edge computing network architecture,it is generally believed that the user experience delay depends on the calculation delay and the communication delay[10,18].Considering that this research scenario involves task migration,the additional overhead of task migration needs to be calculated.Figure 1 shows the occurrence stages of each delay that needs to be considered in the service deployment scenario.We redefine the parameters to improve the classic Lin-UCB algorithm.Notes that the theoretically provided way of solving existing problems through parameterization is not limited to this scenario.
2.2.1 Calculation Delay
λtrepresents the amount of offloading task calculations in time slot t,ctirepresents the computing capacity of node i available in time slot t,that is,the number of basic instruction operations that can be completed per second.Given the service deployment decisionwtof time slot t,the calculation delay can be expressed as:
λt=vtnt,wherevtindicates the amount of data involved in user tasks,ntindicates computational complexity.
2.2.2 Communication Delay
The communication delay is composed of two parts,and the user decides to offload the task to an external computing node,which will cause access delay.If the service is not bound to the corresponding server,the transmission delay will be caused by the communication between the servers.
Given the service deployment decisionwtof t time slot,the communication delay perceived by the user can be further expressed as:
gi,lttrepresents the communication delay of the user,whereltrepresents the base station connected in time slot t,which depends on the coverage of which base station the user’s real-time location.
2.2.3 Migration Overhead Delay
Users are mobile,and their connected base stations will change with movement,accompanied by the migration of computing tasks.The migration overhead delay can be expressed as:
Where the comprehensive parameterftj,irepresents the migration overhead delay from the previous computing node j to the current computing node.The above equation represents the migration overhead delay from node j to node i when the user selects the node in time slot t-1 to j,the comprehensive overhead of task migration when node i is selected in time slot t.
2.2.4 Optimize Goal
The final optimization problem can be seen as finding the minimum value of the weighted sum of three delays within a given limited time range T:
Whereωt1,ωt2,ωt3are the dynamic weights for calculating delay,communication delay,and migration overhead respectively,it can be adjusted according to its optimization preferences and operating requirements for the task.
The expectation of the delay feedbackrtigenerated by each node is a linear function of the context feature vectorxtiwhen executing the task,namely:
Whereθ*iis the parameter vector of the node,which is the amount that the algorithm cannot obtain and needs to be estimated.The ASD algorithm is based on the improvement of the LinUCB algorithm,so this premise also needs to be met.In order to realize this hypothesis,we first need to defineθ*iandxtiaccording to the experimental scenario.This process can also be understood as feature selection for contextual information.From the user experience delay calculation formula in the previous section,it can be seen that the user experience delaydtcaused by the selection of node i in time slot t can be expressed as:
ctirepresents the computing power of node i in time slot t,gti,ltrepresents the communication delay that will occur when the user in time slot t is in the coverage area of the base station corresponding to the access to the MEC serverlt,andftj,irepresents when the user in time slot t-1 When the j node is reached,the service migration overhead that will occur when the user in the t time slot selects the i node.
Using the accurate valueλtthat the algorithm can grasp,the position of the user in time slot t and the selected node j in time slot t-1,the context feature vectorxtiis defined as follows:
Whereλtindicates the calculation requirement of time slot t.ltrepresents the real-time location of the user,ktrepresents the service node selected by the user in the last time slot,and transforms the context information into the above-defined parameters,which achieves the theoretical premise of using the contextual multi-armed gaming machine algorithm in the service deployment scenario.
It is worth noting that the decisions made by the Multi-Armed Bandit algorithm can only tend to the optimal but cannot achieve the theoretical optimal.Because the user experience delayrticaused by the selection of node i in time slot t cannot be accurately predicted,although there is a theoretically optimal nodebecause the algorithm cannot grasp the true value ofθ*i,it can only be estimated asθti.Calculate the estimated valuerti=θ*?i xtiof the user experience delay generated by the time slot selection node,then the difference between the selected node n and the actual best node after each decision is defined as the regret valueΔtof this selection:
Δtcan measure the gap between this choice and the actual optimal choice.The smaller the value,the better the choice.For a long-term decision sequence with T time slots,the cumulative regret valueRTis used to measure the effectiveness of the entire decision sequence.
On the premise ofrtiandxtisatisfying the linear relationship with,the ASD algorithm calculates the index valuePtiof each available node in each time slot:
InFi,we estimate the node parameter vectorθtiand the current context feature vectorxtiof the t time slot,namely:
Among them,θticalculated from the historical information matrixDiand the historical environment feedback vectorci:
whereAi=DTi Di+I,bi=DTi ci,and the historical information matrixDiis composed ofxtiselected as task execution nodes for the first m times before the i-nodes.
TheSiaims to reduce the long-term cumulative user experience delay and defines the node type asyiwhich is positively related to the node’s computing power.From the context vectorxti,we can know the amount of calculationλtrequired for the current task.To evaluate the level of the current calculation amount,the historical information matrixDican be used to calculate the historical average calculation amountby Equation(14).When,it means that the calculation amount of the current task is relatively high.At this time,the probability that the node with strong calculation ability is selected should be increased.
Replace the parameterαwith the type impact factorui,which is defined as:
Because the goal of the strategy is to obtain the lowest environmental return value,the node with the smallest indicator value will be selected at each time slot t,so the original parameter is reversed,andSican be obtained:
On this basis,the algorithm also considers the influence of the total number of time slots T and the total number of types of context information Z on the decision result.Assuming that the context information contains n types of features,each type of feature hasZipossible values.For example,assuming that there areZ1kinds of computing tasks,Z2kinds of geographic locations accessible to users,andZ3kinds of service nodes,the total number of context information types Z is defined as:
The ASD algorithm(shown in Algorithm 1)takes into account the influence of node types and the total number of time slots on service deployment decisions and achieves the design goal of adaptive and dynamic adjustment of service deployment strategies with context information such as task types.At the same time,there is no need to specify algorithm parameters.It avoids the unreasonable value of the algorithm parameter and the adverse effect on the algorithm result.
The operating system is Windows 10 64bit,the processor uses Intel core i7-6700hq@2.60Ghz,the memory is 16g,and the python version used is 3.6.5.
The user makes an independent service deployment decision in each time slot t∈{0,1,2,...,T}in a continuous time range T.In order to compare the effects of different strategies more intuitively,this article makes the following basic constraints on the experimental scenario:
Constraint 1:All MEC service nodes are distributed in a 3×3 grid network,and the user moves to the adjacent grid every 20 time slots,and the direction of movement is random.
Constraint 2:The length of a time slot is greater than the minimum time interval between two tasks;that is,the user only needs to perform task deployment decisions once in a time slot.
Constraint 3:The communication delay part of the user experience delay is related to the user’s geographic location and the location distance of the service deployment node to calculate the data volume of the task.
Figure 2.Comparison of average delay of ?-greedy strategies with different ?values in simplifed scenarios.
Constraint 4:The computing power of computing nodes will change over time,the frequency of a single node conforms to a certain distribution,and the frequencies of the same type of nodes are independent and identically distributed.
5.3.1 Accumulated Regret(ACR)
In a single time slot t,the theoretically best nodeis calculated from the estimated valuertiof user experience delay generated after node i is selected to calculate the regret value of a single decision:
Δtcan measure the correctness of this choice.The smaller the value,the better the choice.Accumulating the decision regret value of the first T time slots is the cumulative regret ACR:
5.3.2 Average Delay(AVD)
The average delay is the average value of the user experience delay in the previous T time slots,which is defined as follows:
It can express the dynamic variability of the decisionmaking effect as the time slot and the number of decisions increases.The lower the value,the closer the strategy is to the theoretically optimal strategy,and the better user experience brought by service deployment.
5.3.3 Regret Optimization Ratio(ROR)
For all n strategies,we calculate the average regret value AVR(average regret)of T time slot decisions of each strategy for comparison and select the largest average regret value among them as the optimization benchmark,and then perform the effect of each strategy on reducing the regret value of a single decision quantitative comparison.Assuming that the average regret value of the strategyw′is the largest,isAV R(w′),the average regret value of the strategyw′is defined as:
The decision regret value of the theoretically optimal offline optimal service deployment strategy is always 0,so its optimization ratio is 100%.Therefore,the higher the value,the better the optimization effect of the strategy.
Compare ASD with several existing methods:service deployment strategy based on?-greedy[19],service deployment strategy based on UCB[20],service deployment strategy based on LinUCB[16],AUSP[15]strategy,random strategy,follow strategy,all local strategy and all cloud computing center strategy.UCB and?-greedy algorithm represent the idea of the classic Multi-Armed Bandit algorithm,which are compared to verify the performance of the improved context algorithm.Comparing with the LinUCB algorithm shows that the ASD algorithm solves the problem of parameter setting.The AUSP algorithm is a similar work chosen to show our optimization performance.
Figure 3.Average delay comparison of LINUCB strategies with different α values in complex scenarios.
Because the setting of theαvalue has an important effect on the decision-making result.However,under different experimental environment parameter settings,theαvalue required to obtain the optimal decision is different.In the?-greedy algorithm,the random value threshold?is not easy to specify before the experiment.In order to observe the optimal effect of the LinUCB strategy and?-greedy strategy in a specific experimental scenario,it is necessary to take differentαand?values for testing.
It can be seen from Figure 2 that when the rewards of nodes are gradually grasped in the later stage of the experiment,a lower average delay can be obtained.In the algorithm comparison experiment,the?value is 0.03,and the number of time slots is 1000.The?-greedy strategy can obtain the lowest long-term average delay.
Compare the three strategies that contain contextual information,and use the average delay as an indicator to test the parameters,and the results are shown in Figure 3.It can be seen that when the number of time slots is 1000 andα=1.5,the LinUCB strategy can obtain a lower long-term average delay and accumulated regret.
5.5.1 Simple Scene
In the simplified scenario,the experimental results of the Bandit algorithm strategy and other strategies are shown in Figure 4.
The simple service deployment strategy does not have the characteristics of learning and dynamic adjustment,so its decision feedback value is always in a relatively stable state,and the average regret value and average delay basically do not change with the number of time slots,so the accumulated regret Basically,it has a linear growth relationship with the number of time slots.
In the simplified scenario,the three service deployment strategies using contextual information can all achieve faster convergence and lower average delay and accumulated regret value.
5.5.2 Complex Scene
Figure 4.Comparison of experimental results between ASD strategy and other strategies in simplifed scenario(top)and complex scenarios(bottom).
In complex scenarios,both the user location and task type change over time,and the same node has different capabilities to provide services to users in different locations and different task types.In this scenario,the comparison results of the ASD strategy,UCB strategy,?-greedy strategy and the four simple strategies are shown in Figure 4.
The optimization effect of the UCB strategy and?-greedy strategy becomes very limited,only slightly better than the random strategy.The simple strategy also ignores changes in the user’s situation and does not affect dynamic adjustment.Therefore,these three simple strategies are no longer the optimal strategies,and the average delay and accumulated regret value have been maintained at a relatively large level.
The ASD strategy has a higher average delay in the initial stage of the experiment,even higher than the random strategy with the worst overall effect.It requires a certain number of explorations to gradually adjust the strategy using context information and system feedback values.When the number of slots is about 400,the average delay reached the lowest,and has maintained a downward trend;the overall optimization effect is the best.
Figure 5.Comparison of experimental results of ASD strategy,LINUCB strategy and AUSP strategy in complex scenarios.
It can be seen that the simple Multi-Armed Bandit algorithm strategy that can achieve a certain optimization effect in a simple scene has a much worse optimization effect in a complex environment that introduces the changing attributes of users and nodes.The ASD strategy effectively avoids the adverse impact of environmental complexity on the decision-making effect,realizes the effective use of context information,and still maintains a good delay optimization effect,indicating that this strategy does achieve adaptive users for service deployment.Notes that the“follow”strategy is close to the optimal of the experiments due to environmental parameters.More specifically,the“follow”strategy avoiding communication delay naturally becomes a good strategy with overall communication delay set to be high,while the calculation delay leaves the cloud strategy better.The key point is their effects are entirely controlled by the environment and cannot be adjusted dynamically.
The comparison result of ASD strategy,AUSP strategy,and LinUCB strategy is shown in Figure 5.The LinUCB strategy with a value of 1.5 has the highest average delay in the initial 100 time slots,but the average delay drops the fastest in the first 300 time slots,and its final accumulated regret value and average delay are the lowest among the three strategies.The final accumulated regret value and average delay of the ASD strategy are higher than the parameter-optimized LinUCB strategy and lower than the AUSP strategy,and its decision optimization effect lies between the two strategies.The final average delay is about 6.8%higher than the theoretical minimum average delay,which is within an acceptable range.
Table 2.Regret optimization ratio results of different strategies.
To quantitatively compare the actual regret optimization effects of various strategies in complex scenarios,based on the theoretically optimal offline optimal strategy and the random strategy with the worst actual comprehensive effect,the regret optimization ratio results of various service deployment strategies are obtained as shown in the Table 2.It can be seen from the table that in the preset complex experimental scenario,the ASD strategy based on environmental context information optimizes the average user experience delay to 69.48% of the offline optimal service deployment strategy based on global system information.Higher than LinUCB,UCB,and other simple service deployment strategies.It is not easy to specify the appropriate algorithm parameters in the actual service deployment environment because the parameter comparison test experiment cannot be carried out before the decision.It has practical use-value.
Aiming at the service deployment problem in the mobile edge computing network,the proposed models the service deployment problem concerning the parameter types involved in the contextual Multi-Armed Bandit problem and uses a reinforcement learning algorithm to propose an adaptive service deployment Strategy to reduce long-term user experience delay and regret value.The ASD strategy algorithm is proposed.Based on the effective use of the context information and historical decision information of users and nodes in the scene,the algorithm parameters are redefined by using parameters such as node type,task type,and the total number of time slots,and get rid of the parameters.Specify the impact on the optimization effect.In the experimental results,indicators such as accumulated regret,average delay,and regret optimization ratio are better than other algorithms.The ASD strategy algorithm includes features in various complex scenarios and does not require additional system information.Larger experiments to complete the calculation of the theoretical optimal route require too much time and data storage,which provides a good starting point for discussion and further research.
ACKNOWLEDGEMENT
This work is supported in part by the Industrial Internet Innovation and Development Project“Industrial robot external safety enhancement device”(TC200H030)and the Cooperation project between Chongqing Municipal undergraduate universities and institutes affiliated to CAS(HZ2021015).