Dehui Wei,Jiao Zhang,Xuan Zhang,Chengyuan Huang
State Key Laboratory of Networking and Switching Technology,BUPT 100876,China
Abstract: Congestion control (CC) is always an important issue in the field of networking,and the enthusiasm for its research has never diminished in both academia and industry.In current years,due to the rapid development of machine learning(ML),the combination of reinforcement learning (RL) and CC has a striking effect.However,These complicated schemes lack generalization and are too heavyweight in storage and computing to be directly implemented in mobile devices.In order to address these problems,we propose Plume,a high-performance,lightweight and generalized RL-CC scheme.Plume proposes a lightweight framework to reduce the overheads while preserving the original performance.Besides,Plume innovatively modifies the framework parameters of the reward function during the retraining process,so that the algorithm can be applied to a variety of scenarios.Evaluation results show that Plume can retain almost all the performance of the original model but the size and decision latency can be reduced by more than 50%and 20%,respectively.Moreover,Plume has better performances in some special scenes.
Keywords: congestion control;deep reinforcement learning;lightweight;generalization
Researchers’ enthusiasm for congestion control(CC)[1—8] has never faded,and papers about it are published every year.With the rapid growth of user traffic,this issue attracts widespread attention from both industry and academia.Under dynamic and complex network conditions,dynamically adjusting the transmission rate to make full use of network resources is the function of CC and also its challenge.To meet this requirement,a good CC scheme first needs to perceive the network conditions.Then,it can consider the impact of various factors including throughput,dynamic latency,packet loss rate,etc.and can quickly select the optimal sending rate.Second,it can widely apply to a variety of scenarios,such as highly dynamic networks,scenarios that require low latency,highly competitive links and so on.Obviously,the traditional CCs fail to meet the above requirements.They often rely on assumptions about the network and determine whether the network is congested based on certain signals.But when the signal is confused,the performance will be greatly reduced and the convergence time will be long.And they often need manual adjustment of parameters for specific scenarios to improve efficiency.
In recent years,with the rapid development of machine learning (ML),learning-based network CC schemes spring up.Because ML is suitable for learning the complex behavior of the network.It has sufficient ability to consider the influence of multiple factors on decision-making.Using learning-based algorithms can learn the mapping of historical feedback to behavior from the environment,without manual tuning for special scenarios.Learning-based solutions can divide into online learning and offline learning.Due to the need to explore the rate,online learning schemes[2,4] usually has a longer convergence time.Since the offline methods use trained models,they have a short decision time.In offline learning schemes,due to the label-free nature of network and CC is a continuous decision-making problem,reinforcement learning (RL) is usually chosen.RL actively learns the relationship between input and output,and seeks to maximize the long-term cumulative reward policy,so it can quickly respond to changes in network conditions and not be confused by noisy signals.
Existing prominent RL-based CCs include Remy[8],Aurora[9],and Indigo[10].As the advantages indicated in the previous paragraph,they are still difficult to deploy and apply in practice.Because they have the following two problems: The first is the overhead.The pursuit of performance by researchers makes the RL-CC model increasingly complicated.Training and using them requires huge computing resources and memory resources that mobile devices(such as mobile phones,smart TVs,drones,etc.)cannot satisfy.However,the latest statistics reveal that these mobile terminals occupy a large part of the client demand.The second is the lack of flexibility and generalization.The concept of generalization means that the algorithm can be custom designed for the scenario without excessive cost.The goal of RL-CC is to maximize the long-term cumulative discount reward.Therefore,current algorithms can only trade-off between generalizability and performance,and may cause the problems of excessive bandwidth preemption or insufficient bandwidth utilization.Moreover,they have a strong dependence on the training environment.When the environmental conditions are too far from training,the performance is considerably reduced.And because of the problem of overhead,it is difficult to retrain in the application environment.Therefore,we still need to design a new adaptive scheme to adjust to the above problems.
To address the above problems,we design Plume— ahigh-performance,lightweight and generalizedRL-Based CC scheme.Inspired by Lottery Ticket Hypothesis[11],we still use the DRL algorithm to output a CC policy.At the same time,Plume designs a pruning policy to faithfully preserve the performance of complex models while reducing the model.Moreover,we analyze the model structure of RL-Based CC in detail and provide a method to replicate small networks,so that the model can be truly reduced and the training process can be accelerated.In the retrain stage,Plume innovatively explores the idea of customizing the reward function for specific scenarios to achieve the generalization.In the experimental part,it is verified that Plume can faithfully retain the performance of complex schemes and reduce the model size by more than 50%that a size can be deployed to low-power devices and reduce the decision latency by~20%.Plume not only saves storage resources but also saves operating expenses.Moreover,using the pruned model structure to retrain with modified Reward Function,it only takes a small amount of steps to achieve the desired tendency.
Contributions:In summary,our key contributions are:
? We analyze the advantages of RL-based CC and illustrate the obstacles of using RL-based CC in practical applications,which motivates the design of Plume,a high-performance,lightweight and generalized CC algorithmusing RL.
? In Plume,we firstly propose thepruning method to lighten the RL and provide theoretical analysis on how to prune.Then wemake Plume generalized,through customizing reward functions and simple retraining.
?We evaluate Plume in a simulation framework.Results show that the pruned model can retain almost all the performance of the original model butthe size and decision latency can be reduced by more than 50%and~20%,respectively.Moreover,customized retraining exhibits better performance than the original model in some specific scenarios.
In what follows,we first summarize the ideas of the related work and their shortcomings in §II.Later,we present the motivation for designing Plume in the same section.In §III,we describe the challenges encountered in achieving lightweighting and generalization.In§IV,we first overview Plume’s design and then dig deep into the design and describe the details of it.In§V,we evaluate the performance of Plume and report its results.We conclude the paper in§VI.
Table 1.Representative congestion control algorithms.
2.1.1 Existing CC Algorithms
The research of CC has a long history.We will consider selected ones from many CC algorithms and divide them into the following three parts to make a summary statement.
Heuristic schemes:As summarized in table 1,nonlearning-based algorithms determine network conditions with locally collected signals like packet loss,round-trip time(RTT) and so on.The early TCP family include TCP Tahoe,TCP Reno[19],and TCP NewReno[20] use packet loss as a signal of congestion and control congestion window (CWND)with AIMD(additive increase and multiplicative decrease)’s heuristic algorithm.When non-congestion packet loss occurs,they face severe performance degradation.Then,BIC and Cubic try to use a new increment function for CWND.However,algorithms relying on packet loss will induce high latency.Thus algorithms combined with latency like TCP Vegas and Copa are derived.The recent proposal BBR approaches the network as a white box and adjusts the sending rate depending on its own assumed network conditions.None of them can accurately perceive changes in network conditions and respond promptly.We will be specified in§2.2.1.
Learning-based:Existing learning-based solutions can be divided into online and offline learning-based Online learning-based schemes include PCC Allegro and PCC Vivace.They think of the Internet as a black box,adjusting the sending rate to the most favorable utility.But they do not learn network utilization according to experience,leading to over-utilization and converging slowly.In the offline schemes,TCP Remy models CC as a decision problem under uncertainty that uses network hypothesis and prior knowledge to map the CWND and pacing rate.It helps improve transmission efficiency,but has poor adaptability.When environment against network assumptions,performance will be greatly reduced.And Remy needs to store a lot of information,which limits his practicality.More recently,due to the unlabeled nature of CC,advanced learning algorithms generally use RL to train a model.QTCP uses RL to design CC.It combines Q-learning and TCP.Aurora generates a policy based on DRL that maps observed network statistics to select the sending rate while Indigo employs imitation-learning.However,they both suffer from over-utilization and practicability is limited to the network conditions of training.
Environment customization:Other algorithms design for specific environments,such as DCTCP,Timely,pFabric applied to data centers and Sprout focus on cellular networks.They often need the cooperation of the underlying hardware,which is not universal,and the cost of updating is high.
2.1.2 Existing Lightweight Methods
Popular methods of network compression include knowledge distillation[21],network prune[22—24]and weight quantization[25—27].Knowledge distillation uses the teacher model to guide the learning of the student model.Student models are trained using supervised information provided by the teacher so that the precision is hard to guarantee.Because of the limitations of the teacher model,the generalization of the algorithm is not easy to improve.Weight Quantization technology hashed the network weights to different groups before training,and the weight values of each group are shared.Because only the weight and hash index need to be stored,storage resources can be saved.However,when the algorithm is used,the shared weights still need to be restored to the original position,which cannot save computing resources.Some works use binary/ternary storage methods,with moderate loss in accuracy.Network pruning to remove unimportant parameters or neurons for the purpose of compressing the model.In contrast,network pruning has more practical implications.We will describe the reasons for this selection in more detail in section 3.2.
The recent research that apply lightweight includes pitree[28].It transforms the adaptive bitrate streaming(ABR) algorithm into a decision tree to save operating expenses.To the best of our knowledge,we are the first work that applies lightweight methods to RL-based CC algorithms.
2.2.1 Advantages of RL-Based CC
We will explain the benefits of using RL-based CC from the following two aspects.
Rapid response to changes in network conditions.RL-based algorithms are able to learn complex patterns of behavior.The agent trained can find complex relationships between the environment and action.Consider a simple network configuration: the bottleneck link bandwidth changes every 5 seconds as shown in figure 1 with a min latency of10ms.The shaded part means the actual bandwidth,which is also the ideal sending rate when there is only one flow.We specify several representative and widely used CC schemes named TCP Cubic(TCP variants widely used in Linux and other systems),BBR (Google release),Copa (Microsoft release),Sprout (For cellular networks),PCC-Vivace (Online learning representative algorithm) and emulate them over this network,report their sending rate performance.As shown in the figure 1,they fail to perceive the network condition or react quickly to changes and behave too aggressively or sluggishly,causing congestion or insufficient bandwidth utilization.The sending decision will be affected by many factors,and it is difficult for traditional algorithms to find the complicated relationship between input and output.Therefore,Plume employs an RL framework that can consider many factors such as packet loss,RTT,throughput,etc.and generate input and output policy based on long-term experience to make the sending rate close to the expected behavior.
Distinguish between noisy information.As described in related work,traditional heuristic algorithms determine the network condition based on the current locally collected information.But this information is sometimes disorientating.Their appearance does not indicate the real condition in the network,such as random packet loss.Relying on this noisy information can make false hypotheses about congestion,and thus the algorithm will choose the wrong action.For instance,the well known TCP family (e.g.Reno,BIC,Cubic)use packet loss as a congestion signal to take action.Yet there are various reasons for packet loss.Traditional loss-based algorithms cannot identify non-congested packet loss.As a result,they suffer a great reduction in performance (10x away in[4])on high non-congestion dynamic links(e.g.satellite links,cellular networks).Then companies and research institutions launched many “variant” protocols to compensate for the needs in different scenarios.However,a certain piece of bad information can still cause a devastating blow to performance.As shown in figure 2,we add 1% random packet loss to a dynamic link.Robust Cubic also fails to fully utilize the bandwidth of the link.In contrast,Plume effectively identifies non-congested packet loss.RL-based algorithms break through the limitations of traditional models to learn empirical network rules,given the long-term jackpot.Therefore,it can organize noisy information and maintain high link utilization.
2.2.2 Problems with Existing RL-Based CC Algorithms
Although RL-based CC has the above advantages,it is difficult to deploy in practice.Existing approaches suffer from the following two issues in general that1)too heavyweight,2)lack of generalization.Next,we will analyze these two issues in detail.They are also the driving force for the proposal of Plume.
Make the model lightweight.Extensive previous work has shown that increasing the complexity of a model can improve its expressiveness.But the addition of each extra dimension significantly slows down the training process.The long training process,high demand for computational and storage resources all greatly hinder the application of RL-based CC on lowpower devices such as mobile devices,drones and smart TV.All of this poses a challenge to such advanced algorithms.What’s more,the users of mobile devices are growing rapidly.More and more enterprises begin to attach importance to lightweight and constantly pursue the limit of deep learning computing,hoping that its advantages can be realized in lowpower devices.The RL-Baesd CCs show them overwhelming strengths,but also leave many limitations.This motivates us to propose a solution to enjoy the ultimate performance experience with RL-baesd CCs.[29]shows that the model needs to be less than 2MB when applying to low-power devices,which is exactly the lightweight goal we are pursuing.
Make the scheme general.As introduced in introduction,existing RL-based algorithms make a tradeoff between performance and generalization.Although they consider many factors,it can only balance in a variety of indicators.As shown in the figure 3,we perform the Aurora in a simulator where the bandwidth of the link varies between 1Mbps and 2Mbps every 2s.Because its reward can only be balanced between throughput and latency,resulting in algorithm over-utilization.Remy also states that it makes a compromise between generality and performance and does not clearly indicate the user’s true preferences.The lack of generalization also greatly limits the scope of application of RL-based-CC in practice.This motivates us to make the algorithm customizable for the training of different needs.
Achieving high performance.A prerequisite for achieving lightweight without losing precision is having an optimal pre-training model.For CC as a decision problem,the first step is to determine the rate of delivery based on historical information to achieve the expected performance.This expected performancecan be higher link utilization,lower latency,or less packet loss.In RL algorithms,this expectation is expressed by the reward function.Hence,the design of the reward function is very critical.How to make the initial model more encompassing is our challenge.Our goal becomes to maximize the total score of the function once the reward function is determined.In theory,we could find a policy to maximize it,but the search can be difficult.The history observed so far contains too much information that we need to pick the most crucial parts of it.Another issue is when to make decisions.
Table 2.The statistics of input.
How to select and apply a lightweight method?The challenges about lightweight are mainly in the choice of method and application.While a variety of compression model techniques have been introduced in§2.1.2,there are many limitations to apply to the CC issues.First,retaining behavior of the original model is the basic requirement for the algorithm.If too much accuracy is lost in the process of compressing model and Plume cannot outperform traditional heuristics,lightweight will not make any sense.However,there is no work to use lightweight methods to RL.So it is a challenge to choose and design algorithms to achieve lightweight without losing precision.Second,we require Plume that can easily reduce the overhead including shrinking the model and accelerating the computation without any other external support.If Plume needs to change or add hardware to it,it is contrary to the original intent of generality.But most of the existing lightweight papers focus on theoretical analysis and do not really realize the model reduction.They often do not mention the ways to implement compression in practice.How to really reduce the model size and speed up the computation process is what we need to think about.The existing works are based on multilayer perceptron(MLP),convolutional neural networks(CNN)and other underlying networks for analysis.The RL models are composed of multiple underlying networks and are interrelated.It is a challenge to achieve a real reduction of the model without affecting its performance.
Keep high performance and generalization.In pretraining,we can maximize the value of an expected reward function.However,the model cannot accommodate multiple objectives.In [8],Remy states that it made a tradeoff between generalization and performance.Remy uses equation (1) to represent the flow’s score whereUα(x)=xandyrepresents throughput and average round-trip delay of the connection,respectively.Uβ(y)is bringβandyinUα(x).αandβexpress the fairness-vs.-efficiency tradeoffs in throughput and latency,respectively.δexpresses the balance between latency and throughput.It is an offline algorithm that allows users to select only one preference.So how to keep high performance of the algorithm under different objectives is a challenge.
Due to the challenges presented in§III,Plume makes the following choices:
Input variables:we synthesize a large number of experiments and analyze the essence of CC to pick the states in table 2 as input.These follow the idea of[9].Further,we add the packet loss rate parameter.Many existing state-of-the-art learning-based CCs optimize the reward function toward high throughput,low latency and low packet loss.We also reward and penalize these three variables.Plume’s input is mainly calculated and collected for these three variables.These parameters cover a range of status that are representative of the network condition and greatly shorten the states of sender tracking.
Compression method:As described in related work,we choose to use the network pruning technique by considering both precision preservation and practicability.Pruning can solve the second challenge described in the above subsection(§3.1)for two reasons:
?Over-parameterization:Over-parameterization means that during the training stage we need a large number of parameters to capture the tiny information in the data,but once the training has been completed to the reasoning stage,we do not need so many parameters.However,if the model is too small,the expression space of the model is limited,which will result in underfitting.Neural network pruning can reduce a large number of parameters(over 90%[11]).
?Lottery ticket hypothesis:In the paper[11]published in 2019 shows that the sparse network derived from the initial network pruning can train better than the small network initialized directly.By training complex deep networks,we make the model more expressive.Through pruning the network,the performance of the model can be retained more completely and achieve lightweight.
The figure 4 shows the overview of Plume.We will introduce the algorithm through the following three parts:
?Training the initialization model (§4.2.1):In order to fully learn the rules of the network,Plume first trains a complex DRL model based on previous experience.In this way,we can capture every detail in the environment that impacts proper actions through complex models.
?Model pruning (§4.2.2):Plume lightens the deep-network using structured pruning[30]while preserving most of the performance.We propose a network replication approach to obtain the compression models,We propose a network replication approach to obtain the compression model.
?Retrain the model structure (§4.2.3):In the process of retraining,Plume proposes a plan to modify the reward function.We get the ”winning ticket” (effective model structure) through the second step.Through our experiments,it is proved that modifying the reward function can get the more desired effect,and the result is obtained faster than training another complex model and pruning.
Next we will give a detailed introduction to these three modules.
4.2.1 Training the Initialization Universal Model
Plume’s operation is to find the best mapping for the action based on states.This mapping relationship is called policy in RL.The job of Plume at this stage is to pre-compute the policy to be adopted in the model.Training models requires the following elements which we follow the idea of[9]:
Input and output:In§3.2,we illustrate the input to the model.At eachmonitor intervals(MIs),the agent collects observation values.Then algorithm calculates them into parameters and inputs into the model to select the send rate.Since the CC algorithm has a huge action space,the output of the agent is the parameterat,which is used to correct the current sending ratext.Useat ?xtfor each modification.The positive or negative ofatdetermines the increase or decrease of the rate.
Reward:Reward defines the goal in the RL problem,guiding the tendency of policy changes.In different applications,the demands on the network are not the same.We first design Plume from the most common objective of CC,that is high-bandwidth and low-latency.However,random packet loss is severe on high-loss networks,and packet loss is prone to occur when greedy for high bandwidth.So we consider packet loss as a penalty item in the reward function.Therefore,we formalize the reward in Aurora as follows:
That is,high throughput is used as a reward item,and latency and loss are used as a penalty item to train the policy.
The training structure and algorithm:The model chooses to use a fully connected neural network in the network architecture and uses the Proximal Policy Optimization (PPO)[31] algorithm to learn policy.PPO is a kind of policy gradient,based on the Actor-Critic algorithm.Plume tries to calculate a new policy in each iteration while ensuring that the deviation from the policy of the previous iteration is relatively small.It is suitable for the complex and huge exploration environment like the network.In the algorithm,the agent plays the role of actor and trains its policyπ,and its network parameter isθ.The agent continuously interacts with the environment to form a trajectoryτ:
sis the state andais the action.The reward obtained by the sequenceτis the sum of the rewards obtained at each stage,calledR(τ).Therefore,when the actor’s policy isπ,the expected reward that can be obtained is:
In order to maximize the expected reward,the gradient descent method is used to update the network parametersθ.In PPO,in order to speed up the training,another actor will be trained through importance Sampling to get another strategyπ,and the corresponding network parameter isθ′,then the previous policy will be calledπold.πis the target policy we are updating,andπoldis the policy behavior record.Then the expected likelihood function at this time becomes:
Aθ′(st,at)is the advantage of(st,at).If the probability distributions ofπandπoldare too different,and the learning step is too large,the policy will fluctuate greatly.The algorithm wisely adds Kullback-Leibler(KL) divergence to measure the gap between the two distributions to limit the scope of the update,then the update process becomes:The expression ofJθ′(θ)is descirbed in Equation(5).
4.2.2 Model Pruning
After the discussion in§3.2,the structured pruning scheme stands out.As for applying the algorithm to the CC model we need to solve the following two problems:1)How to decide which neurons to prune?2)How to make the model truly reduced and be able to reuse?Next,we will explain our design around these two issues.
Theoretical foundation of pruning.The basic method is to prune the initial model depending on the importance of the neurons.The closer the weight is to 0,the less impact on the result.Pruning out a neuron is equivalent to cutting out all the connections associated with it.However,it is unlikely that all connections of a neuron are close to 0.So we choose to use the L2 norm to calculate the sparsity of neurons.Why use L2 norm to evaluate the importance of neurons?L2 norm is expressed as the equation(7)whereXis an ndimensional feature (x1,x2,x3,...xn).The L2 norm acts to amplify the effect of the values,especially for small values.The constraint is possible to obtain the minimum when as many weights as possible in a neuron are close to zero.In this way,neurons with lower scores have less influence on the model.
Then our goal is to find the remaining neurons in each layer to maximize the pruning objective defined in (8).Wherelrepresents each layer of the model network,andnrepresents the remaining neurons in this layer.gis the group of weights related to thenneuron.||Wlng||is the norm calculated by this group of weights.We choose to use the L2 norm to calculate the sparsity ofWn.Then the original problem is transformed to maximize the L2 norm of the remaining neurons and the score of each neuron is calculated as equation(9).
Reduce the model.In the last part,we describe how to evaluate the importance of neurons,and in this part we will explain how to perform pruning.The reference to existing work in§III does not indicate the application method.Instead,it simply sets the relevant weight to 0.In order to achieve the goal of lightweight,we have designed and tried many approaches,such as: 1) Express the tensor in a sparse way that means only to store the positions and values which are not 0.But in this way,tensors need to be parsed when other applications reuse the model,and cannot work directly.The restored matrix still has zero values and cannot speed up the calculation.2) Directly change the matrix of the stored model to remove the weights that are zero.But neither tensorflow nor pytorch supports this kind of operation very well.And this will destroy the original model that causes a loss of reusability.In order to solve the problems,we use the method of network reproduction.According to different percentages of pruning,customize a model of the same size as the rest of the network.Then copy the parameters not pruned into it.In this way,the original model is not broken,but can obtain a really reduced model.But the difficulty is how to build the small model.Because the RL algorithm we use is more complicated,the model network structure obtained is also very complicated.In order to clarify the model architecture,we print out each layer of the model and analyze it carefully as figure 5.
Neural Architecture Learning.As shown in the figure 5,the solid black line represents the process of information transmission while the solid blue line represents the process of back propagation and updating the weight.The red dotted line is part of the loop update.It can be seen that the model saved by the PPO algorithm will store two policies,πandπold.Each policy is divided into critic network(vfnetwork)and actor network(polnetwork).There is also aq ?valuenetwork.Each network adopts a fully connected way.Prune’s process only considers the underlying connections of neurons,but the reduced process requires all connections related to neurons.The stored weight matrix of each layer represents the connections between the neurons this layer and the neurons in the lower layer.And thepollayer and thevflayer are stored alternately.For this reason,we first need to separate thepollayer and thevflayer to find out the matrix elements related to the neurons pruned in the lower layer.Then Plume copies the remaining elements into the new network to form a new model,waiting to be applied to different scenarios for retraining.
In pruning,we use a “one-shot” approach as opposed to the common way.Which means that we cut off the excess neurons at once in proportion.In the pruning process,the previous work uses the prune→fine-tune method to ensure the performance is preserved.For example,if the application needs to be pruned by 50%,the previous work will first prune 10%and then retrain,and so on iteratively.Plume,on the other hand,and will directly prune 50% of the neurons.
We reserve the retraining for when the model is employed in different scenarios.The model is lightweight to ensure operation at the remote end.And it establishes the foundation for customized training afterwards.The experiments in chapter 5.2 demonstrate that pruning in this way also preserves most of the performance of the original model.
4.2.3 Customized Retraining
Another disadvantage of RL-based for CC is lack of generalization,which is also a common problem in traditional heuristic algorithms.As shown in figure 3,in order to maintain high throughput,some latency can only be sacrificed,causing packet loss.We try other clean-state methods of reward function.In[32],it designs the reward function according to an important metric called Power defined asPower=[33].The results show that the same problem still exists,and the detailed results are shown in section§5.5.Although this meets the general requirements of high throughput and low latency,in order to pursue more extreme performance in different scenarios,it is necessary to customize for varied networks.
In order to solve the above problems,we innovatively propose to customize the reward function for the environment in the retrain phase.When retraining,iterative pruning is usually used.But this is not our requirement,and it is more suitable for unstructured pruning.So we adopt One-shot pruning as mentioned before,it is more important to keep the network structure after pruning.This structure preserves the connection between input and output.When retraining,the framework of the reward function is still based on equations(3).We use differentα,β,ccombinations to train the small model according to the needs,and the detailed results will be described in§V.
Why does this method work? First,the model has become lightweight.The previous RL-based CC training would consume a lot of computing resources and will take a long time to train again.So the scheme can only be used after offline training.CC after our lightweight processing,the retraining time becomes shorter and the resources occupied are reduced,so simple retraining can be performed in the application.Second,we get an effective model.It has been proved that our small reduced model can retain the performance of the large model (§5.2).And only a small amount of retraining can achieve the tendency expressed by the same structure reward function.So Plume can be customized for different scenarios to achieve optimal performance.
We verify Plume’s performance from multiple perspectives through simulation experiments including performance retention,overhead,and generalization capabilities.The evaluation results show that:
?Performance.Plume can maintain the performance of the original model in many aspects of sending rate,throughput,and latency.Moreover,after retraining,the latency can be reduced by over 50%and the packet loss rate can be reduced by over 80%.In the environment that competes with TCP Cubic,Plume can have faster convergence speed.
?Overhead.Plume can reduce size over 22%and reduce decision delay by over 19%.
?Necessity.The experimental results show that directly training the model of the same size after pruning does not have good performance.And environmental customization is also very necessary.
We use python to build a network simulation environment and use an open-source gym to train the agent.We complete an interface in the framework to trim the model at different scales using tenorflow.In order to facilitate modification and testing,we leave an interface for the parameter of reward function.
The trained model is saved in ackptformat.In the next step,the model will be re-read.Another network will be constructed according to the size of the pruning.Then,we will set the cropped column to 0 in the weight matrix.At this time,only the lower connection of the neuron is deleted,and the upper connection remains unchanged.This will not affect the final calculation result but will confuse the process of copying into the small network.Therefore,we need to disassemble the model according to the analysis in§4.2.3,pruning all the relevant connections of the neuron.Then a truly reduced model is formed and retrain it.
The network topology tested is the same as most CC papers that is an end-to-end structure.The network’s packet loss rate fluctuates between 0~0.05.Link latency fluctuates between 0.05~5ms.Buffer size fluctuates between 0~8 orders of magnitude of packet size.We tested Plume in various link bandwidths.We have done experiments in a large number of different environments.The following paper mainly shows the results in two environments.Others can be found in the appendix.Environment one: Link bandwidth fluctuates regularly every 2s.Environment 2: Irregular random fluctuation of link bandwidth.
Setup:First train the initial model.We test a variety of neural network architectures under different network conditions,including different number of layers and neurons.More complex network conditions require more complex model learning rules.If the model is not large enough,it will not be able to fully learn knowledge.Under the irregular network conditions,the performance is poor.Finally,We choose to use four hidden layers composed of 32→32→32→16 neurons to do the simulation experiment.Reward function reference [9] is 10?throughput ?1000?latency ?2000?loss.We prune the model in different proportions,such as 25%,50%and 75%.Then we compare their performance from three aspects of sending rate,throughput and latency under different network conditions.Results in environment one and two are shown in figure 6.Result:The results are shown in figure 6.In the three subfigures,the orange line represents the performance of the initial model.We focus on the degree of fit between different color lines,the better fitting means that the performance is retained more completely.figure 6a indicates the retention degree of the sending rate.From the experiments,it can be found that the performance is still well preserved when the pruning ratio reaches 70%,and even the sending rate is more reasonable than the initial model.The same is so with throughput.However,as can be seen in figure 6c,the latency starts to increase when the percentage of pruning reaches 70%.It proves that the model still has a lot of room for lightweight.And it can maintain the original model’s functionality after reduction.
Setup:In order to prove the necessity of lightweight,we directly use 50% of the network of the model in the§5.2 for training.The training environment is the same.And test it under the environment two,the results are shown in figure 7.
Result:It can be seen from figure 7 that the model has not learned the connection between network information and sending rate.It can’t respond effectively to bandwidth changes that waste network resources.On the other hand,the model after pruning that has the same size network with small model has a good performance as shown in figure 7.This shows the necessity of training complex models and the effectiveness of pruning to achieve lightweight.
5.4.1 Decision Latency and Model Size
Setup:In§5.2,we confirm that the model after prune can still maintain proper performance.Next,we will discuss the overhead of Plume.We will measure both the size of the model and the decision latency of the scheme.In order to measure the decision latency,we put each model into the environment two for simulation.The CPU used in the simulation is Intel Xeon(R)Gold 6230 CPU@2.10GHz×32,and the memory is 16GB.We record the time that model takes to obtain information from the environment to make a decision,and conduct multiple tests to average.The result of decision latency is shown in the figure 8.We also compare the model size before and after pruning to check if the model is suitable for deployment.
Result:It can be seen from the figure that the reduction of decision latency is also proportional to the proportion of pruning.The model after pruning can reduce decisions latency over 19%.Shorter decision time means that you can react to network conditions more quickly and uses network resources more effectively.After many measurements,results show that the reduction of model size is proportional to the proportion of pruning,at least 22%.Size reduces to about 800K and less than 2MB that suitable for low power devices.
5.4.2 CPU Consumption
Setup:Since the experiment is conducted in the form of simulation,the events are performed in a discrete form and will fill up the CPU when running.So it is difficult to measure the relationship between the model pruning and CPU consumption.However,we come up with a new way that convert the CPU load as decision latency/decision interval*CPU load.The result is shown in the figure 9.
Result:As can be seen from the figure,the CPU utilization gradually decreases as the pruning ratio increases.The maximum reduction can be~16%.This means that fewer computing resources are required when the scheme is applied,which is very friendly to mobile devices.
Setup:As mentioned in the §2.2.2,the existing RLbased CC has a problem that the reward function is monotonous so that the schemes lack generalization.In order to prove that this problem is not only present in aurora,we train another model according to the reward function mentioned in §4.2.3 and put them into environment two for testing.The report of their sending rate and throughput are shown in figure 10.
Result:As shown in the figure,Reward 1 represents the linear function we use,and Reward 2 is the Power function in §4.2.3.It can be seen that Reward 2 can quickly change the action according to the bandwidth,but there will be large fluctuations.After many tests,this problem still exists.And there will be the same problem as [9],which can only be balanced between different performances.Therefore,we need to customize the training for the environment so that the scheme can have extreme performance in different scenarios.
Setup:In order to achieve low latency,we redesigned the reward function that multiply the throughput parameterαby 0.3,and the latency parameterβby 2 and retrain.The link latency varies from 0.1msto 0.3ms.We put each model into environment two to test.We record the send rate,latency and packet loss rate of the model with a pruning ratio of 25%and 75%,compared with the original model.The results are shown in the figure 11.
Result:Figure 11a shows the test results of the latency.Aurora’s average latency is 0.7ms.The average latency after prune and retrain is 0.34ms.Average latency is reduced by~50%.And no matter how the bandwidth changes,the latency of the retrain model fluctuates very little,unlike aurora.Figure 11b shows the result of packet loss rate under the same conditions.For universality,aurora will send as many packets as possible to the network,causing serious packet loss.The model after prune can quickly reduce the packet loss rate to close to 0 after only a small amount of retraining.This is very important for high random loss links.
Figure 1.Send rate of traditional and online-learning representative CC algorithms over a emulated network that bottlelink bandwidth changes every 5 seconds.
Figure 2.A trace of throughput Plume and TCP Cubic on a link with random packet loss.
Figure 3.A trace of aurora on a link which alternates between 1 and 2 Mbps every 2s.
Figure 4.The overview of Plume.
Figure 5.Model architecture analysis.
Figure 6.Performance comparison of models before and after pruning under different network conditions.
Figure 7.Using the same size network as that after pruning for training,the performance is very bad.
Figure 8.Overhead of Plume: the time it takes for the model to obtain information from the environment to make a decision.
Figure 9.Overhead of Plume: CPU consumption.
Figure 10.Results under different reward function training.
Figure 11.Modify the reward function,multiply the throughput parameter α by 0.3,and the latency parameter β by 2 and retrain.Latency and packet loss rate comparison.
Figure 12.Modify the reward function,multiply the throughput parameter α by 4,and the latency parameter β by 2 and retrain.Throughput standard deviation comparison.
Figure 13.Overall normalized avg.delay,and avg.throughput in several real traces.
Figure 14.Performance comparison of models before and after pruning under different network conditions.
Setup:To improve the aggressiveness of the algorithm,we redesign the reward function that multiplies the throughput parameterαby 4,and the latency parameterβby 2 and retrain.We have completed the TCP Cubic simulation in the framework.TCP Cubic is a representative CC algorithm based on packet loss,with strong preemptiveness.We put Cubic and Plume into a network environment where the link bandwidth gradually decreases and gradually increases.The standard deviation of the bandwidths obtained by each scheme is shown in figure 12.
Result:The blue solid line in the figure represents the ideal bandwidth allocation.As can be seen,Plume and Cubic finally successfully converged to fair allocation.This experiment shows Plume’s ability to quickly adapt to environmental changes,and it has a faster convergence speed than Cubic.What’s more,it can also be seen from the experiment that even in the face of such a ”strong opponent” Cubic,Plume will not be disadvantaged and can maintain fairness.This proves the advantages of customization for the environment and Plume’s flexibility and generalization.
Setup:To demonstrate the performance of Plume under real-trace,we acquire real traffic traces from the mahimahi website[34].These traces represent timevarying cellular network traffic for US users.We select several sets of scenarios to be applied under the framework described in§5.1.The ellipse indicates the standard deviations from the average values of the delay and the throughput and their covariance,and the center of each ellipse shows the average of the results.
Result:We have selected a traditional algorithm representative -Cubic and a learning-based algorithm representative Aurora for comparison with Plume.As can be seen in following figure 13,since Cubic’s aggressiveness causes longer latency and random packet loss can degrade his throughput severely.Compared to Cubic,Aurora has a better performance.However,it can be seen that Plume retains consistent high performance no matter under which trace.
In this paper,we propose Plume,a high-performance,lightweight and generalized congestion control with deep reinforcement learning.Plume uses pruning technology to faithfully retain the performance of the original model while reducing computing and storage resources.Moreover,Plume uses the modification of the reward function to enhance the generalization of the scheme.Extensive experiments show that Plume can accomplish high performance and low overhead at the same time.Therefore,we believe that Plume can improve the friendliness of mobile device deployment and is universal for different network conditions.
ACKNOWLEDGEMENT
This work is partially supported by National Natural Science Foundation of China(NSFC)under Grant(No.61872401),National Natural Science Foundation of China (NSFC) under Grant (No.62132022),Fok Ying Tung Education Foundation (No.171059),and BUPT Excellent Ph.D.Students Foundation(No.CX2021102)
APPENDIX
In order to reveal the generality of Plume,we supplement the two environments presented in §5.1.As shown in figure 14,the shaded part represents the real bandwidth including the highly variable link,etc.It can be seen that Plume maintains its pre-pruning performance in different environments.