Ali Forootani,,Raffaele Iervolino,,Massimo Tipaldi,and Joshua Neilson
Abstract—A stochastic resource allocation model, based on the principles of Markov decision processes(MDPs), is proposed in this paper. In particular,a general-purpose framework is developed, which takes into account resource requests for both instant and future needs.The considered framework can handle two types of reservations(i.e.,specified and unspecified time interval reservation requests),and implement an overbooking business strategy to further increase business revenues. The resulting dynamic pricing problems can be regarded as sequential decision-making problems under uncertainty, which is solved by means of stochastic dynamic programming (DP) based algorithms. In this regard, Bellman’s backward principle of optimality is exploited in order to provide all the implementation mechanisms for the proposed reservation pricing algorithm.The curse of dimensionality,as the inevitable issue of the DP both for instant resource requests and future resource reservations,occurs. In particular,an approximate dynamic programming(ADP) technique based on linear function approximations is applied to solve such scalability issues.Several examples are provided to show the effectiveness of the proposed approach.
RESOURCE allocation is defined as the set of problems in which one has to assign resources to tasks over some finite time horizon to customer requests.Many important realworld matters can be cast as resource allocation problems,including applications in air traffic flow management[1],energy[2],logistics,transportation,and fulfillment[3].These problems are notoriously difficult to solve for two reasons.First,they typically exhibit stochasticity,i.e.,the requests to be processed may arrive randomly according to some stochastic process which,itself,depends on where resources are allocated.Second,they exhibit extremely large state and action spaces,making solution by traditional methods infeasible[4],[5].In the fields of operational research and artificial intelligence,primarily, the state space as well as decisions(or actions)are discrete.In this regard,resource allocation problems with discrete states and decisions are studied at length under the umbrella of Markov decision processes(MDPs)[6].Systems with uncertainty and nondeterminism can be naturally modelled as MDPs[7],[8].For instance,emotion recognition in text[9],speaker detection[10],and fault-tolerant routing[11]are considered asMDPs.
The optimal policy for MDPs can be computed by applying exact dynamic programming(DP)techniques thanks to their strength in solving sequential decision making problems[7].However,it is well known that such techniques suffer from theCurse of Dimensionality,which is due to state and action space explosion of real-world applications[8].For this reason,efforts have been devoted to finding the techniques able to solve this problem in an approximate way[12].This field has evolved under a variety of names including approximate dynamic programming(ADP),neuro-dynamic programming,and reinforcement learning [13]–[15].
In this paper, resource allocation problems are formulated and solved via a general-purpose MDP-based framework,which can be used for different real business contexts.We address both instant(i.e.,customers requires a resource to be allocated immediately)and advance(i.e.,the customer books a resource for future use)resource requests.It is considered that the same resource can be sold at different price values at different times to take the advantage of heterogeneous preferences of customers over time,e.g.,a seat on an airplane or a room in a hotel.Both the formulation and the corresponding resolution for resource allocation problems with instant resource requests were firstly explored by the authors in[16],where only exact DP approaches were applied.In[17], the authors further extended the approach to incorporate the possibility that, besides an immediate allocation request,a customer can book a resource in advance for future utilisation.Two types of booking procedures were considered, that is to say,booking resources with specified and unspecified time interval options.
The main differences of this paper with[17]are the following:i)new assumptions and procedures both in modeling and in the resource reservation approach;ii)the usage of the unweighted sample based least squares(USLS)-ADP algorithm instead of a temporal difference ADP based approach[18]to solve the curse of dimensionality;iii)comprehensive and analytical algorithms suitable for computer based implementation.As for the first aspect, the proposed solution manages the overbooking situations, which occur when the number of allocated resources at the current time slot can not be confirmed since new resources have been already allocated for the next one(due to the advance resource reservation mechanism),and the overall needs can not be satisfied by the system overall capacity.Such overbooking situations occur since the system handles both instant and future resource requests.Managing them entails a significant update in the modelling of resource allocation problems,its dynamics,and the resulting allocation policy.
The USLS-ADP algorithm was presented for the first time by the authors in[19],where its convergence properties over an infinite time horizon are also discussed.The USLS-ADP algorithm inherits both the contraction mapping and monotonicity properties of the exact discounted DP algorithms[20].Thanks to this,it is suited for both finite and infinite time horizons.As a consequence,it can be used for solving resource allocation problems with instant and advance reservation requests.The latter case actually involves time intervals over finite (possibly very large)time horizons.
As for the third aspect, we provide the implementation mechanisms of the reservation pricing algorithms,starting from the steps defined in[17].For instance,we show how to exploit Bellman’s principle of optimality[7]to assess the allocation of future resources at different prices and their impact on the current expected total revenue.More specifically,a stochastic prediction of the system evolution(up to the time when the new resource is requested)is performed.Then,the set of possible prices is applied and assessed based on how they affect the current expected total revenue.When the most suitable price is chosen, the complete pricing policy is renewed.This approach shows how to bridge the gap between model predictive control(MPC)and DP[13].
Various parts of the proposed modeling and optimization approach,consisting of DP,reservation procedure,and ADP,are implemented in the MATLAB environment for resource allocation problems in a general framework.Moreover,different examples are provided to support and evaluate the effectiveness of the method.
This paper is organized as follows.Section II shows how to model resource allocation problems via MDPs.Resource allocation problem modeling with specified time interval reservation requests and the related pricing algorithm are provided in Sections III–V.Resource allocation problems with unspecified reservation time intervals are outlined in Section VI.Section VII addresses the usage of the proposed reservation pricing algorithm for resource allocation problems with large state space.Simulation results are provided in Section VIII.Section IX outlines the scientific literature relevant to this work.Finally,Section X concludes the paper.
This section shows how to formulate resource allocation problems as a set of constrained parallel discrete-time birth death processes(BDPs)[21],which are integrated into one Markov decision process(MDP).The configuration of the resulting MDP based framework can be controlled by the price manager,who assigns a price amongmpossible choices by applying a specific pricing policy.
Such approach was firstly introduced by the authors in[16].Hereafter,the main aspects of such framework are outlined along with the notation used in this manuscript.We also provide some preliminaries on how to solve the related decision making problem via DP based techniques[7],[20].For more details,the reader can refer to[16].
This paragraph addresses the problem of dynamically pricingN∈ N equivalent resources and allocating them to customers.A set ofmhourly prices(or prices per unit of time)is given,and price managers can select one of them in order to maximize the expected total revenue.They can also reject resource requests from customers,if deemed not convenient from a profit standpoint.Price managers can charge different prices for the same resource over time depending on the resource availability and expected profit.As shown in[16],it is possible to formulate such price management system as a set of BDPs.In particular,there is a dedicated BDP for each feasible priceci(withi=1...m),which allows modelling the unpredictable behavior of customers in requesting and releasing resources.As a result,the system evolves as a set of parallel BDPs.By assigning a specific price at each time slot,the price manager defines which BDP is active for one(possible) birth and one(possible)death,whereas all the others are active only for one(possible)death.In this regard,we assume that:
?At maximum,only one customer can request a resource at each time slot.Moreover,each customer can request it for either immediate or future reservation(the latter addressed in the next sections).
?The time slot duration is chosen so that,for each priceci,at most one customer associated to each BDP may leave at
This section introduces the DP framework used to compute a proper pricing policy for resource allocation problems with advance resource requests.Such framework along with its related definitions is used in the next section,where the reservation pricing algorithm is described.
This section describes the algorithm for calculating the price profile to be proposed by the price manager in case of a reservation request for the future time interval [h1,h2].
The following assumptions are made:
?The initial reservation vectorsxjfor all the time slotsj∈[0,T?1]are provided. Note that,as time goes by,these vectors can be updated since customers can request resources in advance to satisfy their future needs.
?without any loss of generality,the underlying Ti-MDP is computationally tractable, that is to say, ADP and Monte Carlo simulations are not needed.
In the algorithm,a backwards iteration strategy is proposed.More specifically,the whole time interval is scanned backwards,and for eachh∈[h1,h2]the best price is calculated. At the same time, the temporary set of reservations is updated along with the associated optimal value function and policy.Finally,the resulting price profile is proposed to the customer.In case the customer accepts it,the new set of reservations is confirmed,as well as the associated optimal value function and policy.
The reservation pricing algorithm consists of the following steps(at the beginningk=0):
In the normal course of events,it may happen that customers request resources for future utilization without specifying their release time.As a result,unlike the allocation timeh>k,such resources are released according to the system stochastic dynamics,that is to say,the release event is linked to the death rate μiof the proposed price.This section outlines the modeling of the underlying Ti-MDP with specified and unspecified time interval reservation vectors,as well as the associated pricing algorithm.
Letsbe the set of unspecified time interval reservations.In particular,we define
The proposed approach has been evaluated over numerical cases to show its effectiveness.Both exact DP and ADP techniques have been used with the support of on-purpose developed MATLAB programs.The latter is configurable,meaning that they provide the capability of defining the values of the problem to be solved,e.g.,λ ’sand μ’ s fori∈{1,...,m},the number of resources(N)and prices(m),and the time horizon(T).For all the examples we have used the basis functions given by(52),i.e.,the number of resources associated to each pricecimultiplied by parameterri.
It can be shown that the state space explosion can occur even with relatively small values ofmandN.Indeed,increments in such parameters cause an exponential growth in the size of the state space[19].Hence,the exact DP becomes impractical even for relatively small problem instances.The simulation examples are divided into two main parts.The first part is dedicated to the DP reservations algorithm results,while the second part is for the USLS-ADP algorithm.
Policies computed by DP techniques can be represented by means of lookup tables.In other words,for a given state and time slotj,one can associate the corresponding action calculated by the proposed algorithm.However,such representation can be impractical even for small state and action spaces.Therefore,more compact representations could be used[18].In this regard,we apply a statistical index showing the frequency distribution of each action at each time slot over the entire state space.It is worth highlighting that such statistical index for policies can also be impractical in cases of large action spaces.
In all the proposed examples, we have used the following state dependent birth and death probabilities:
We have assumed that the death (birth)probability increases(decreases)with the number of the allocated resources.
Example 2:Number of pricesm=3, number of resourcesN=6, pricec=[0.9 1.0 1.1], λmax=[0.55 0.5 0.3], λmin=[0.3 0.2 0.1] ,μmax=[0.5 0.6 0.6].
The cardinality of the state space is 84,and the exact DP algorithm can be applied.Here,for the sake of simplicity, we enclosed the prices in bracket,e.g.,c=[c1c2c3].
The following operational scenario has been considered:
?Specified time interval reservation requests at the consecutive time slotsj=2,3,4 with the duration reported in Table I.
TABLE I Specified Time Interval Reservation Requests for the Example 2
?Unspecified time interval reservation request at the time slotj=1 for the time sloth=20.
The frequency distribution of the different actions(normalised versus the state space cardinality and over the finite time horizonT=40)is shown in Fig.3.In particular,the above-defined operational scenario(with reservation requests)is compared with the case of no reservation requests,where the DPoff-line solution calculated at the beginning can be applied for instant resource needs.To analyze the result,one has to start from the terminal stage and move backwards.Since there are neither unspecified nor specified time interval reservations for the time slotsj>21, based on the principle of optimality,the frequency distributions of the reservation and no reservation cases are identical.
Fig.3.Comparison of the action frequency distributions for the reservation and no-reservation cases(Example2).
As shown in the same figure,the frequency distribution of the rejection control ν increases for the intervals with reservations.It is noticed that for the case of no reservation,the control ψ is never applied.Therefore,it is not plotted.
Additionally,the specified time interval reservation vector setxand the unspecified time interval reservation vector setsare depicted in Fig.4.As shown in the figure, the pricec3is more likely to be chosen than the others.Moreover,for the case of specified time interval reservations,the algorithm does not propose the pricec2to the costumer requests;hence,we do not plot the associated plot.In the case of unspecified time interval reservations, the algorithm only proposes the pricec3;hence,we do not plot the other prices.
Fig.4.Specified and unspecified time interval reservation vectorsx and s(Example2).
Finally,it is worth highlighting that the specified time interval reservation request [h1,h2]=[14,21]is rejected by the algorithm.
Example 3:The same resource allocation problem set-up of the Example 2 is considered, but with a more complex operational scenario.
The following operational scenario has been considered:
?Specified time interval reservation requests handled at the time slotsj=1,4,5,7,...,12 with the durations reported in Table II.
TABLE II Specified Time Interval Reservation Requests for the Example 3
?Unspecified time interval reservation requests handled at the time slotsj=2,3,6 for the time slotsh=12,22,25,respectively.
Thanks to the principle of optimality,the frequency distribution curves of the reservation and no reservation cases are identical for the time slotsj>25,see Fig.5.However,for the time intervals14 ≤j≤16,there are differences between such curves.For the interval 1 4 ≤j≤16,the rejection policy ν is on its maximum point since the system is fully reserved for the associated time slots.
Fig.5.Comparison of the action frequency distributions for the reservation and no-reservation cases(Example3).
Furthermore,the vectorsxandsare depicted in Fig.6.Unlike the previous scenario,the pricec2is proposed for some intervals.The total number of customers corresponding to reservation vectorsxands(regardless of the associated prices)are depicted in Fig.7,which shows that all the resources are allocated for the time interval 14 ≤j≤16.
In this section,an essential literature review about the most pertinent articles on MDP modeling and relative solutions for resource allocation problems is provided.A homogeneous continuous-time Markov chain is proposed in[26]to model the patient flows for the hospital ward management.The optimization of the matching between the resources and the demands is performed by means of a local search heuristic algorithm.MDPs are employed in[27]for Business Process Management with the goal of making appropriate decisions to allocate the resources by trying to minimize the long-term cost and to improve the performance of the business process execution.A heuristic based Reinforcement Learning approach is adopted as optimization method.In[28],a resource allocation problem for the vehicular cloud computing systems is discussed.Since the objective is the maximization of the total long-term expected reward, the optimization is formulated as an infinite horizon MDP with the state space,action space,reward model,and transition probability distribution of the particular case study. An iteration algorithm is utilized to develop the optimal scheme that computes which action has to be taken in a certain state.MDP based modelling and solution methodology for scheduling patients in a multiclass,multi-resource surgical system is employed in[29].The proposed model provides a scheduling policy for all surgeries,and minimizes a combination of the lead time between patient request and surgery date,overtime in the operating room,and congestion in the wards.A least square temporal difference ADP approach is to deal with the curse of dimensionality.One of the most important operations in the production of growing-finishing pigs is the marketing of pigs for slaughter.In[30],a sequential marketing decisions at the herd level is considered as a high dimensional infinite horizon MDP.A value iteration ADP algorithm is used to estimate the value function for this infinite time horizon problem.The stochastic behavior of the food banks inventory system has been modelled by using an MDP in[31],which has the advantage of indicating the best way to allocate supplies based on the inventory level of the food bank.Such paper presents a novel transformation of the state space to account for the large distribution quantities observed in practice and shows that the particular underlying stochastic behavior can be approximated by a normal distribution.Similarly to our approach,both stochastic and deterministic aspects are addressed.In[32],a sequential resource allocation problem with an objective function aimed at equitable and effective service for the problem of distributing a scarce resource to meet the customer's demands is carried out.In this work, through a DP framework, the structure of the optimal allocation policy for a given sequence of the customer's demand is characterized as continuous probability distributions.In this regard, by using the identified optimal structure,a heuristic base allocation policy for the instances with discrete demand distribution has been proposed.
In some other works,resource allocation problems are treated as multi-agent systems.In[33],for instance, the dynamic of the agents is considered as second order differential equations,while they communicate over weightbalanced and strongly connected digraphs.The optimization problem is formulated as a constrained convex objective function.The effectiveness of the method,however,is evaluated for a small number of agents,only.An alternative approach can be found in[34],where the distribution of a common resource between two sources of time varying demand is carried out to develop the time-efficient methods for minimizing delays at severely congested airports.In this work, the problem is formulated as a DP optimization and the objective is based on the second moments of the stochastic queue lengths.It is shown that for sufficiently high volumes of the demand,optimal values can be well approximated by the quadratic functions of the system state.Again,a heuristic based approach is applied as ADP method. A comparison between Monte Carlo tree search and rolling horizon optimisation approaches is carried out in[35]for two challenging dynamic resource allocation problems: the wild fire management and the queuing network control.Even though this work shows interesting results, the reported techniques are suitable just for the specific applications considered. Another example of resource allocation strategies can be found in[36], where the problems of budget allocations of non profit organizations on geographically distinct areas is tackled.The proposed solution consists in formulating the overall resource allocation problem as a twostage stochastic mixed integer programming problem.A heuristic-based approach is finally used to simplify the original formulation.
An interesting variant to the solution of (stochastic)resource allocation problems is represented by the MPC,especially suited when the dynamics of the systems is considered to be variable over time(time-variant processes).In[37],for instance,it is shown how the stochastic resource allocation problem can be addressed by suitably modifying the MDP and the optimal control problem and using MPC to allocate resources over time.In particular,a new class of algorithms for the stochastic unreliable resource allocation problem is proposed, when there is a chance that the task will not be completed by that resource.However,a well-defined and accurate prediction model is a priori needed for an effective strategic allocation control.Similarly,in[38],the solution for the stochastic resource allocation problem makes use of MPC integrated with machine learning and Markov chain model.The theory is based on a three layer lambda architecture and particularly tailored to the case study of a dispatch decision problem from an energy distribution utility.
As a general remark,it is noted that our paper provides a sufficiently comprehensive modeling framework, which is not limited to a specific application.Moreover,the optimization algorithms for price policy calculation exploit the most advanced ADP techniques to address the scalability issues of real-world applications,instead of resorting to heuristic or example driven methods.To the best of the authors' knowledge,the current literature on this topic do not have these features.
Admittedly,one of main difficulties of applying ADP based approaches is the choice of proper basis functions,which is a current area of research(see Section VII-A).Moreover,it is assumed to know the system stochastic mechanisms and the related probability distributions.If this is not feasible,one can resort to computer simulators(generating samples according to such probability distribution)or Reinforcement Learning based approaches[8],[12].
Resource reservations in resource allocation problems have been modelled as a general-purpose MDP based framework.Stochastic DP based approaches have been proposed to compute proper pricing policies,and show how Bellman’s principle of optimality can play a role in the implementation of the resulting pricing reservation algorithms.However, the resulting framework,which also includes an overbooking business strategy, becomes computationally intractable in case of realistic resource allocation problems.As a consequence,ADP techniques based on linear function approximations have been employed to solve such scalability issues.In particular,the novel USLS-ADP algorithm has been applied.Examples addressing both specified and unspecified time interval reservation requests have been shown,solved,and analyzed to verify the soundness of the proposed approach.
As for future work,we plan to apply the proposed framework to relevant business applications,such as flight ticket booking,urban parking management,and smart energy management systems.This implies defining the probability distributions of the underlying stochastic processes.In case of their unavailability,it is possible to resort to computer simulators(generating samples according to such probability distribution)or reinforcement learning based approaches.
IEEE/CAA Journal of Automatica Sinica2020年4期