• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    HOPE:a heterogeneity-oriented parallel execution engine for inference on mobiles①

    2022-02-11 08:58:32XIAChunwei夏春偉ZHAOJiachengCUIHuiminFENGXiaobing
    High Technology Letters 2022年4期

    XIA Chunwei(夏春偉),ZHAO Jiacheng,CUI Huimin,FENG Xiaobing

    (*Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,P.R.China)

    (**School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100190,P.R.China)

    Abstract

    Key words:deep neural network(DNN),mobile,heterogeneous scheduler,parallel computing

    0 Introduction

    Deep neural networks(DNNs)are increasingly adopted in mobile applications and have become the core building blocks of top apps[1].Typically,for these applications,the inference latency is a significant issue for users.Meanwhile,mobile platforms are emerging towards heterogeneous embedded systems[2],which integrate a variety of computing devices into a mobile system-on-chip(SoC).These mobile computing devices exhibit huge diversity in performance,power consumption,memory capacity and programming interface[3].

    To efficiently exploit the heterogeneity and support artificial intelligence(AI)applications on heterogeneous mobile platforms,several frameworks are proposed.For example,TFLite[4]could run inference workload on graphics processing unit(GPU)through GPU delegate or other accelerators through the Android neural networks API(NNAPI).MNN[5]supported running DNN models on mobile GPU through OpenCL,OpenGL,Vulkan or Apple Metal.These mobile frameworks provide fundamental support for running a DNN model across different platforms.However,there still lacks heterogeneity models to automatically distribute a DNN model across the on-chip processing units.To address this challenge,MOSAIC[6]used a heterogeneity-,communication-,and constraint-aware model slicing approach to distribute a DNN model across computing devices on a heterogeneous platform.However,MOSAIC considers the DNN model as a linear model without exploring the inter-operation parallelism.On traditional CPU-GPU hybrid servers,researchers have proposed a number of approaches for dynamically scheduling parallel tasks to heterogeneous hardware[7-11],and a representative work is StarPU[12].However,these runtime heuristics are not applicable to mobile platforms.The reason is that DNN models have some special directed acyclic graph(DAG)topology structures,which are built from layers or blocks with similar structure and sizes,especially for large models.But StarPU is unaware of the DAG topology characteristics,thus missing the opportunity of globally determining the scheduling policy and causing performance loss.Ref.[13]proposedμlayer which accelerates each NN layer by simultaneously utilizing heterogeneous computing devices on mobile SoCs.

    To materialize the optimization,there are several major challenges.Challenge 1:trade-off between performance and scheduling time.Scheduling on mobile platform requires an acceptable scheduling time to obtain the execution plan.StarPU is lightweight but the performance is degraded,leaving the optimization opportunity submerged.Challenge 2:communication overhead.Running DNN operators to heterogenous computing devices will introduce communication overhead which requires careful consideration.For DNN frameworks with CPU and GPU kernels using different data layout,μlayer will introduce significant data layout translation overhead.

    In this paper,HOPE,an end-to-end lightweight heterogeneous inference framework is proposed to distribute a DNN model coordinately on different computing devices of a platform.The key insight is that many DNN models exhibit inter-operation parallelism and enable different operations to be executed in parallel on multiple computing devices.Meanwhile,the topology characteristics of DNN models can be used to seek for an optimal scheduling solution.HOPE first profiles the computation latency for each DNN operator,together with the communication cost of each tensor between every two computing devices.Then,the problem is formalized as an integer linear programming(ILP),and for complex DNN models that require an extremely long time for ILP solver,HOPE partitions a graph into multiple subgraphs and solves ILP for each subgraph individually.

    For challenge 1,two distinct algorithms are proposed.The ILP based scheduling method is proposed for best performance but longer scheduling time,while the heuristic method is designed to get the execution plan much faster with moderate performance.Typically,it takes several seconds for the ILP based algorithm to get the execution plan while it takes less than one second for the heuristic method.A heuristic algorithm is also proposed to partition the DNN model into multiple modules,with each module containing several operators.Finally,HOPE includes an execution engine for simultaneously launching modules to different computing devices according to the execution plan.Experimental results demonstrate that HOPE can reduce up to 36.2% inference latency(with an average of 22.0%)than MOSAIC,22.0%(with an average of 10.2%)than StarPU,respectively,and 41.8%(with an average of 18.4%)thanμlayer,compared with the stateof-the-art work.

    1 Problem formalization

    The problem of scheduling a dataflow graph model across multiple heterogeneous computing devices can be formalized as the following.

    A dataflow graph can be described withG(V,E),whereVrepresents the set of vertexes andErepresents the set of edges.Typically,a vertexvi∈Vrepresents an operator like convolution in DNN models.While an edge(vi,vj)represents that operatorvjdepends on the output of operatorvi.Therefore,vjcannot start executing untilviis finished.Given a dataflow graph modelG(V,E),and a setDcontaining all computing devices of the target platform,for eachvi∈V,its execution time on devicedjiscli,j;for each(di,dj)pair(di,dj∈D),the communication cost isC(di,dj)(T)if a tensorTis required to transfer fromditodj.The algorithm’s objective is to find out an execution planRdetermining the execution devicedjfor each nodevi,which minimizes the end-to-end latencyτ(R,G).An execution planR(b,ψ)contains two parts,i.e.,a mapping matrixbindicating the target device for eachvi,and an order matrix arrayψindicating the execution order of the operators mapped to each device.InR(b,ψ),bis a two-valued mapping matrix recording the selected device for each operator,with each elementbi,jrepresenting whether the nodeviis scheduled to the devicedj,i.e.,

    ψis a matrix array,and thej-th element is a matrix representing the execution order of the operators mapped todj,i.e.,

    2 ILP-based theoretically optimal solution

    In this section,firstly,the parallel DAG execution problem is formalized as an ILP,by considering the computation time of each node and communication cost between computing devices.Then a graph partitioning algorithm is proposed to reduce the graph complexity for the ILP solver.

    2.1 ILP formulation

    2.1.1 Node latency

    The overall processing latency of a nodevion a devicedjcomes from two parts.The first part is the kernel computation latencycli,jof nodevion the computing devicedj.If nodeviis not supported bydj,cli,jis set to infinite(+∞).The second part is the communication latency of receiving tensors fromvi’s predecessors.In particular,for the predecessorvj,the communication latency fromvjtoviis denoted asC(d(vj),d(vi))(T),whered(vj)andd(vi)represent the computing device forvjandvirespectively,Trepresents the tensor transferred fromvjtovi.Furthermore,whenvihas multiple predecessors,its total communication cost is computed by aggregating the tensors from all its predecessors,i.e.,

    Therefore,the processing latencyti,jof nodevion devicedjcan be formalized to the following expression.

    (bi,j-bk,j)×C(d(vk),d(vi))(T)describes whether there is communication overhead between the predecessorviand nodevk.If nodeviand nodevkare distributed to different devices(for example nodevkis dispatched on CPU and nodeviis dispatched on GPU),then the output tensor produced by nodevkneeds to be transferred from CPU memory space to GPU memory space.If(bi,j-bk,j)is less than zero,thenbi,jmust be zero andbk,jmust be one.The right side of the inequation is less than zero and the equation is always true.In other words,on deviced(vj)vertexvidoes not bring any constraint on vertexvj.Note that the communication from the predecessors can overlap with the execution of operators that have no precedence constraints withvi.

    2.1.2 Objective function

    The objective is to minimize the execution time of the computation graph,thus an auxiliary variablesti,jis introduced to describe that nodevistarts its execution on devicedjat timesti,j.The end-to-end computation latency of the DNN model is the interval from the starting time of the first node to the completion time of the last node.Therefore,the objective is to minimize the end-to-end latency.

    2.1.3 Graph topology constraints

    For nodeviand nodevk,if there is an edge(vi,vk)∈E,the graph topology constraint must be satisfied:

    This constraint describes that for any nodevkon any devicedjit cannot start execution until all its predecessor(s)viare finished.Ifbi,jis 0,the constraint will always hold true.Note thatMis a sufficient large positive number.

    2.1.4 Device constraints

    In this paper,it is assumed that a computing device will not be shared by multiple operators simultaneously,thus at any time there can be at most one operator running on each device.And the execution of a task cannot be interrupted.If there is a path from nodevito nodevj,the device constraint would be naturally satisfied since the graph topology constraint guarantees thatvjwould start aftervi.Therefore,only pairs of nodes that have no path introduce the following constraint.

    This constraint illustrates thatviandvkcan be executed in any order but cannot execute in parallel on devicedj.An auxiliary variableψi,k,jand a sufficiently large numberMare introduced for the ILP formalization,as

    whereψi,k,jis a binary variable which describes the execution order of nodeviandvk.If it equals to 0,viwould execute aftervk,and vice versa.

    2.1.5 Node constraints

    Each node is expected to be executed only once:

    2.1.6 Summary

    With constraint Eqs(1)-(6),the problem can be solved by using a standard ILP solver,e.g.,GLPK[13].Finally,the execution plan is expressed using two variables,bi,jdescribing the device placement of the nodes,andψi,k,jdescribing the execution order of the node pairviandvk.

    2.2 Graph partitioning problem

    The graph partitioning problem is formalized as follows.Given a DAGG=(V,E)and an imbalance parameter?,find an acyclick-way partitionP=V0,V1ofVsuch that the balance constraint

    is satisfied and the vertex cut is minimized.The set{V0,V1}represents the two subgraphs’vertexes of the graphG.w(v)is the weight of vertex and is set to 1 for all the vertices.w(Vi)is the sum ofw(v)for all vertexvinVi.To find a proper partitioning point,the concept of upward rank is included,and is defined as

    Intuitively,the upward rank ofviis the longest path from the input vertex(es)ofGtovi.The upward rank of all the vertexes in a DAG can be computed in a single traversal ofGin the topology order with a time complexityO(|V|+|E|).The upward rank value of vertexes is used to partition the DAG into subgraphs by recursively partitioning the subgraphs until all the subgraphs have the number of vertices less thanNT.NTis set to 12 in the evaluation and can be configured.

    Algorithm 1 shows the pseudo-code of the partitioning approach with up rank.The algorithm tries to find an appropriate upward rank value that partitionsGto two parts with approximate number of vertexes and minimize the vertex cut.The algorithm first computes the upward rank value of each vertex in the graph using equation(line 1).Then the minimum and maximum upward rank value is computed(lines 2-3).After that the algorithm counts the number of vertexes with every upward rank value by traversingG(lines 5-6).Then by the increasing order of up rank,the accumulated number of vertexes is summed up(lines 7-8).The variable accUprank[rank]represents how many vertexes there are whose upward rank value is less or equal than rank.Then the algorithm traverses all the upward rank values and finds in which upward rank value the balance constraint can be satisfied and the vertex cut can be minimized.Then the partitioning result with the minimum vertex cut is obtained(lines 9-18).The computational complexity of this partitioning algorithm isO(|V|log(|V|)).The balance factor?is set to 0.2 in this paper and if no balanced partition can be obtained,the algorithm will increase the?by 0.1 until a valid partition scheme is found.

    Fig.1 shows an example of graph partitioning using upward rank.The number in the vertex is the upward rank.There are 6 vertexes whose upward rank value is less than or equal to 5 and 7 vertexes whose upward rank value is greater than 5.If partitioning the graph with upward rank 5,the balance constraint is satisfied as(1+0.2)=7.8,and the vertex cut is 1,which is also minimized.The coarse graph withV0andV1being the vertexes is also acyclic.

    3 Heuristic scheduling algorithm

    In this section,a greedy algorithm is proposed to rapidly obtain a near-optimal execution planRe,including the device placement of each operator and the execution order of the nodes on each device.The key pointis that the algorithm schedules a batch of operators for which the optimal execution plan is obtained.Furthermore,operators having the minimal starting execution timestwould be scheduled first,where the minimal starting execution time of an operator is defined as the maximal completion time of all its predecessor nodes.

    3.1 Greedy search algorithm

    First,the algorithm greedily selects the top-Knodes that have the minimal starting execution time,by sorting the starting execution time of all operators in the increasing order,and these top-Knodes would be the candidate operators for scheduling.Second,the algorithm enumerates all possible operator-to-device mappings and selects the mapping leading to the minimal completion time of theseKoperators.If a DAG hasynodes and the target mobile platform hasxdevices,the heuristic-based algorithm’s computation complexity isO(yk×xk).The algorithm empirically setkto 4 for two computing devices and 3 for three computing devices.Finally,the algorithm updates the successor operators of theseKnodes for their starting execution time using the computed completion time of theKnodes.Algorithm 2 shows the pseudo-code.

    3.2 Node merging for heuristic algorithm

    In DNN models,there exist several short running operators that have very low computation latency comparing with the computation-intensive operators.For example,on the Redmi,the latency of ReLU is less than 0.1 ms while that of Conv2D may be more than 3 ms.This observation motivates us to introduce node merging algorithm.For a short running operator,if it has only one predecessor,the algorithm prefers to dispatch it to the same computing device as its predecessor.The merged node is called super-op.The algorithm first runs the node merging algorithm then calls the heuristic scheduler to generate the execution plan.The merged nodes within a super-op will be scheduled to the same device.For frameworks that support kernel fusion,HOPE’s scheduler will schedule the fused kernels as a whole.

    4 Framework

    Fig.2 shows the HOPE framework.First,a DNN model is fed to the benchmarking engine,which collects the execution time of each node on each computing device of the target platform,together with the communication cost between each pair of devices for the tensors in the model.Second,graph pre-processing partitions the DAG into several smaller modules,with each module consisting of a number of super-ops,as discussed in subsection 4.2.Third,the heterogeneityaware scheduler leverages the scheduling algorithm in subsection 2.1(ILP-Scheduler)or subsection 3.1(GS scheduler)to generate an execution plan that determines the execution order of all nodes and the target device for each node.Finally,the HOPE heterogeneous execution engine uses the generated execution plan and launches the parallel execution across different computing devices.

    Fig.2 System overview

    4.1 Benchmarking engine

    The benchmarking engine serves to collect all the profiles that are needed by the scheduler.First,it runs the DNN model,launches each node to each available computing device,and records the corresponding execution time.Second,it collects the information for all tensors passed between adjacent nodes,including their shape and layout information,synthesizes benchmarks to mandatorily transfer each tensor between any pair of computing devices,applies the corresponding layout transformations when necessary,and records these execution time as the communication cost.In some cases,communication will not introduce extra cost,e.g.,transferring a tensor between a big core and a little core of one unique CPU,thus,such communication cost is set to zero.Typically,a variety of computing devices support different tensor layouts via vendor-provided or third-party libraries.For convolution operations,CPU implementation uses the data layout of NC4HW4[5],while GPU with OpenCL implementation uses the layout of OpenCL image object with the shape being(C/4×W,N×H,4)in the MNN framework.The profiling overhead of the communication cost and layer-wise computation latency is limited,less than 10 min for one DNN model.And performance variation is low(around 1%)thus this paper only reports the average results.The number of tensors with different shapes is much less than the number of operators,there are 1049 operators but only 39 tensors with different shapes in NASNET-large.

    4.2 Heterogenous execution engine

    The heterogeneous execution engine reads an executionplan and executes the DNN model by launching each operator to its target computing devices.Before execution,the execution engine first inserts necessary data transfer and layout transformation statements according to the determined execution plan.Fig.3 shows an example of inserting a communication operator to transfer and transform the tensor 1 from CPU memory space to GPU memory space.

    Fig.3 Example of inserting communication operator

    The execution engine creates an individual thread for each computing device,with the main thread serving for the CPU big core.To avoid interference across the threads of each computing device,HOPE uses sched_set_affinity API to bind each thread to a dedicated CPU core.Users can configure the number of threads for operators running on CPU.The execution engine introduces a queue for each computing device to keep the nodes that are placed on the device according to the execution plan.Each operator has one flag executed to indicate whether the operator has been executed.After oneoperatorviisexecuted,the engine traverses all its successor operatorsvkin the order determinedby the executionplan.Fora given successorvk,ifit isready forexecution and ismapped to the same device,it will be launched;if it is ready for execution but mapped to another device,the engine would notify the thread of the target device and launch the operator on the corresponding device.Otherwise,the thread would keep sleeping.

    4.3 Tensor caching

    Motivated by the observation that a tensor would be used multiple times on a device,HOPE introduces a tensor-oriented optimization named tensor-caching.After a tensor has been transferred and transformed to a device,it would be cached on the target device,so that when the tensor is reused the corresponding communication cost can be eliminated.As soon as no operator will read it,the tensor will be freed.

    4.4 Optimizing CPU and GPU communication

    For the mobile systems in which CPU and GPU share the same physical memory and support shared virtual memory,HOPE provides an optimization to eliminate the memory copy in the communication between CPU and GPU.Specifically,OpenCL provides a few mechanisms to avoid costly memory copy between CPU and GPU.HOPE utilizes the CL_MEM_ALLOC_HOST_PTR flag provided by OpenCL to create OpenCL buffers and the clEnqueueMapBuffer API to map the OpenCL buffer to a host pointer.Then the CPU can update the mapped OpenCL buffers to transform tensor data layout directly without additional memory copy.

    5 Evaluation

    5.1 Platform and benchmark

    Mobile systems.The experiments are conducted on 3 commercial mobile phones including Redmi Note 4x(low-end(LE)with Snapdragon 625),Xiaomi 9SE(medium-end(ME)with Snapdragon 712),HUAWEI P40(high-end(HE)with Kirin 990),ranging from low-end to high-end mobile platforms.

    DNNs and datasets.HOPE is evaluated on seven most popular CNNs including Inception-v3(INV3)[14],Inception-v4(IN-V4)[15],PNASNET-mobile(PN-M)[16],PNASNET-large(PN-L)[16],NASNETlarge(NAS-L)[17],and SqueezeNet(SQZ)[18],and one long short-term memory(LSTM)[19]model which follows the configuration in DeepSpeech[20].All the CNN models are trained on the ImageNet dataset.The LSTM has one layer with 1024 hidden states and 10 time steps.The workloads include the DNN models both highly optimized for mobile systems(e.g.,PN-M and SQZ)and models for highest accuracy(e.g.,INV4 and PN-L).Further,the evaluated CNN models exhibit widely different number of operators ranging from 40 to 1049,which indicates that efficient scheduling algorithms are required to place the numerous operators to computing devices for heterogeneous parallel execution.

    Implementation.The latest version of the GNU linear programming kit(GLPK)is used to solve the ILP formulation.The heterogeneous execution engine is written in 5000 LOC in C++.The graph partitioning and scheduling algorithms are written with 4000 LOC in Python.

    5.2 Overall performance

    To investigate the effectiveness of HOPE in terms of inference latency with different computing device combinations and configurations,a series of comparative experiments have been conducted.MOSAIC,the state-of-the-art heterogeneous execution engine for DNN model inference,and StarPU’s dequeue model data aware ready(DMDAR)policy,a most suitable and classical heuristic that is used a reference heuristic in many recent scheduling studies used in comparison with the proposed approaches.On the LE mobile system 4 CPU cores are used,and all the 2 big cores on the ME and HE mobile systems and their GPU for evaluation.Note that StarPU does not use all the CPU cores on the big LITTLE architecture mobile platforms.The reason is that the framework is unaware of the heterogeneous processing architecture and will distribute workloads evenly to big and little cores.The little cores would slow down the computation without carefully scheduling.HOPE is aware of the heterogeneity of CPU clusters,and the detailed experimental results is shown in subsection 6.5.Fig.4 shows the normalized inference latency of the seven DNN models on the three mobile systems.For each DNN model,six different versions are evaluated:CPU only(CPU),GPU only(GPU),CPU and GPU with MOSAIC as the scheduler(MOSAIC),CPU and GPU with StarPU’s(DMDAR)policy(StarPU),CPU and GPU with HOPE’s heuristic scheduler(HOPE:GS),and HOPE’s ILP scheduler HOPE:LP version.All the inference latency of each version is normalized to the GPU version.

    Fig.4 Normalized inference latency on the three mobile systems

    The tensor cache optimization is enabled by default for all versions.Fig.4 demonstrates the effectiveness of HOPE’s ILP and heuristic scheduler.First,HOPE significantly outperforms the performance of MOSAIC.Specifically,HOPE with ILP scheduler exhibits 23.4%,24.1%,2.9%,29.4%,31.1%,21.3%,23.3% lower latency than the MOSAIC version of the seven DNN models respectively of the three mobile systems on average.This is because HOPE can schedule the DNN operators to different computing devices and run them in parallel.Second,HOPE with ILP scheduler exhibits 13.7%,9.1%,2.9%,17.9%,18.0%,10.0% and 0% lower latency than the STARPU version.HOPE:LP significantly reduces inference latency by finding the optimal solution for each subgraph and globally determining the scheduling policy.Third,HOPE:GS also reduces the inference latency by 8.2%,3.2%,0.5%,12.9%,11.7%,0.0% and 0.0% compared with StarPU version.The performance gain comes from two aspects.The first is that HOPE:GS merges nodes before scheduling and the potential communication cost is reduced.The second is that HOPE:GS schedules nodes in a larger search window(withKnodes)while StarPU considers only one node.Note that for LSTM there is no computation reduction compared with StarPU.

    The reason is that the cell has only four gates and the structure is rather simple.The experimental results clearly demonstrate the effectiveness of HOPE in that it can effectively reduce the inference latency than MOSAIC and StarPU.HOPE exhibits a similar inference latency to that of the CPU alone version with PNAS-M on all the three mobile systems.This is mainly because the PNAS-M is specifically optimized for CPU inference.For instance,it takes 43 ms running PNAS-M with CPU while 183 ms with GPU on the HE.

    5.3 Scalability with CPU performance

    Next,consider the peak performance ratio between CPUs and GPUs across mobile SoCs varies[3],the frequency of the CPU cores varies to investigate the performance scalability of HOPE scheduler.The LE is chosen as the mobile system and the available CPU frequency ranges from 652 MHz to 2016 MHz.The maximum,medium and minimum CPU frequencies in the CPU’s scaling_avaliable_frequencies list and GPU are used for evaluation.Fig.5 shows the normalized inference latency of each version on the DNN models.HOPE:LP reduces up to 26.4%(with an average of 19.2%),25.2%(with an average of 17.5%)and 29.1%(with an average of 22.1%)inference latency than the MOSAIC version,and 17.2%(with an average of 10.8%),20.4%(with an average of 9.3%)and 17.0%(with an average of 10.3%)inference latency than the StarPU version,with the minimum,medium and maximum CPU frequencies respectively.Similar observations can be found for the HOPE:GS version.Experimental results show that HOPE still outperforms the MOSAIC version and the StarPU version when the peak performance ratio between CPU and GPU varies.

    5.4 Comparison withμlayer

    This work targets at full-precision DNN model inference thus HOPE is compared againstμlayer using FP32.μlayer runtime is implemented on ARM compute library(ACL)as described in Ref.[13].One benefit of using ACL rather than MNN is that NEONand OpenCL-based kernels in ACL utilize the same data layout,and no data layout conversion overhead is introduced when CPU and GPU need synchronization.

    Fig.6 shows the normalized execution latency of μlayer and two scheduling algorithms of HOPE.The results show that HOPE:GS can reduce the computation latency by up to 32.9%(HE),32.8%(ME)and 25.4%(LE)over single-layer acceleration ofμlayer.HOPE:LP can reduce the computation latency by up to 40.6%(HE),35.9%(ME})and 28.0%(LE).The reason is that single-layer acceleration ofμlayer needs fine-grained CPU-GPU synchronization for each layer while HOPE does not.Note that for LSTM HOPE does not reduce the latency overμlayer as HOPE and μlayer partition the LSTM cell with the same split ratio.HOPE can reduce 25.8% and 21.3% computation latency on average for the HE and ME platforms,respectively,while only 8.0% for LE.The reason is that the performance of the CPU and the GPU on the LE platform is severely imbalanced.

    Fig.6 comparison withμlayer

    5.5 CPU big+LITTLE

    Mobile SoCs adopted big LITTLE technology to balance the performance and power efficiency.The‘LITTLE’processors are designed for maximum power efficiency while‘big’processors are designed to provide maximum compute performance.HOPE can schedule the DNN models considering the heterogeneous processing architecture of CPU.The awareness of big LITTLE of HOPE is evaluated by using 2 big cores+4 little cores on the HE mobile system.The communication overhead can be eliminated when using big and little clusters as they share the same memory space and tensor layout for computation.Fig.7 shows the result.Specifically,HOPE with HOPE:LP reduces 15.7%,2.2%,2.6%,9.8%,8.3%,3.2% and 0%inference latency than StarPU.HOPE can efficiently reduce the inference latency and improve the QoS when using only CPU.HOPE:GS exhibits similar performance as STARPU for there is no communication cost and HOPE:GS cannot obtain benefit from merging nodes.

    Fig.7 Normalized latency with big LITTLE cores

    6 Conclusion

    In this paper,HOPE,a heterogeneity-oriented parallel execution engine for DNN model inference on mobile systems is proposed.HOPE profiles operator’s computation latency of the accurate models,pre-processes the DAG to modules,and schedules the DAG with an ILP-based or a greedy-based algorithm to determine the near-optimal heterogeneous execution plan.The experimental results show that HOPE can significantly reduce the computation latency compared with the stateof-the-art work including MOSAIC,StarPU andμlayer.In the future,DNN models with control-flow will be supported and policies will be proposed to dynamically schedule operators.Furthermore,the potential benefits of collaborative execution on accelerators like NPU and DSP will also be explored.

    e午夜精品久久久久久久| 国产精品久久久久久久电影 | 精品久久久久久久末码| 日本在线视频免费播放| 在线十欧美十亚洲十日本专区| 国产精品综合久久久久久久免费| 天堂√8在线中文| 99riav亚洲国产免费| 老司机午夜福利在线观看视频| 国产色婷婷99| 国产午夜福利久久久久久| www.色视频.com| 两个人视频免费观看高清| 一进一出抽搐动态| 人人妻人人看人人澡| 身体一侧抽搐| 国产成人福利小说| 精品熟女少妇八av免费久了| 午夜视频国产福利| 久久久久亚洲av毛片大全| 国产国拍精品亚洲av在线观看 | 欧美绝顶高潮抽搐喷水| 久久久久久人人人人人| 国产精品日韩av在线免费观看| 男人和女人高潮做爰伦理| 在线播放国产精品三级| 国语自产精品视频在线第100页| 午夜影院日韩av| 变态另类丝袜制服| 国产麻豆成人av免费视频| 看免费av毛片| 日韩欧美 国产精品| 97超视频在线观看视频| 欧美成狂野欧美在线观看| 中文在线观看免费www的网站| 白带黄色成豆腐渣| 国产精品久久久久久精品电影| 最近最新中文字幕大全免费视频| 精品日产1卡2卡| 国产精品99久久久久久久久| 国产真人三级小视频在线观看| 香蕉丝袜av| 97超级碰碰碰精品色视频在线观看| 国产色婷婷99| 亚洲精品亚洲一区二区| 听说在线观看完整版免费高清| 国产精品美女特级片免费视频播放器| 18+在线观看网站| 伊人久久大香线蕉亚洲五| 女生性感内裤真人,穿戴方法视频| 麻豆国产av国片精品| 亚洲欧美日韩高清在线视频| 国内少妇人妻偷人精品xxx网站| www.999成人在线观看| 嫁个100分男人电影在线观看| 尤物成人国产欧美一区二区三区| 欧美三级亚洲精品| 有码 亚洲区| 亚洲人成电影免费在线| 欧美日本视频| 18禁黄网站禁片免费观看直播| 精品免费久久久久久久清纯| 国内久久婷婷六月综合欲色啪| h日本视频在线播放| av欧美777| 女人十人毛片免费观看3o分钟| 成年女人毛片免费观看观看9| 久久精品亚洲精品国产色婷小说| 少妇丰满av| av视频在线观看入口| 老司机午夜福利在线观看视频| 亚洲va日本ⅴa欧美va伊人久久| 精华霜和精华液先用哪个| 狠狠狠狠99中文字幕| 国产精品久久久久久久久免 | av天堂中文字幕网| 美女高潮喷水抽搐中文字幕| 三级毛片av免费| 中文字幕av在线有码专区| 嫩草影视91久久| 国产一区二区三区在线臀色熟女| 国内精品久久久久久久电影| 国产毛片a区久久久久| 男人舔女人下体高潮全视频| 日本一二三区视频观看| 午夜日韩欧美国产| 欧美av亚洲av综合av国产av| 欧美xxxx黑人xx丫x性爽| 99久久成人亚洲精品观看| 国产亚洲精品久久久久久毛片| 91麻豆精品激情在线观看国产| 桃红色精品国产亚洲av| 一级毛片高清免费大全| 日韩欧美精品v在线| 日日干狠狠操夜夜爽| 精品99又大又爽又粗少妇毛片 | 国产成+人综合+亚洲专区| 丁香欧美五月| 天堂影院成人在线观看| 国产老妇女一区| 97超视频在线观看视频| 在线看三级毛片| 国产成+人综合+亚洲专区| 久久精品国产自在天天线| 久久天躁狠狠躁夜夜2o2o| 国产aⅴ精品一区二区三区波| 午夜两性在线视频| 亚洲精品国产精品久久久不卡| 无人区码免费观看不卡| 免费无遮挡裸体视频| 一进一出好大好爽视频| 一进一出抽搐动态| 欧美不卡视频在线免费观看| www.999成人在线观看| 国产美女午夜福利| 欧美激情久久久久久爽电影| 国产精品一区二区三区四区久久| av片东京热男人的天堂| 国产毛片a区久久久久| 精品国产美女av久久久久小说| 免费人成视频x8x8入口观看| 夜夜躁狠狠躁天天躁| 亚洲专区中文字幕在线| 欧美在线一区亚洲| 怎么达到女性高潮| 国产三级中文精品| 母亲3免费完整高清在线观看| 99久久成人亚洲精品观看| 老汉色∧v一级毛片| 亚洲无线在线观看| 在线播放国产精品三级| 91久久精品电影网| 欧美成人性av电影在线观看| 欧美大码av| 我的老师免费观看完整版| 色综合亚洲欧美另类图片| 天天躁日日操中文字幕| 夜夜看夜夜爽夜夜摸| 亚洲成人久久性| 2021天堂中文幕一二区在线观| 午夜a级毛片| a级毛片a级免费在线| 久久久色成人| 欧美国产日韩亚洲一区| 又黄又粗又硬又大视频| 每晚都被弄得嗷嗷叫到高潮| eeuss影院久久| 日韩欧美精品免费久久 | 又黄又粗又硬又大视频| 亚洲精品美女久久久久99蜜臀| 热99re8久久精品国产| 又黄又粗又硬又大视频| 国产精品一及| 首页视频小说图片口味搜索| 国产精品1区2区在线观看.| 99热只有精品国产| 性欧美人与动物交配| h日本视频在线播放| 长腿黑丝高跟| 亚洲第一欧美日韩一区二区三区| 在线天堂最新版资源| 麻豆国产97在线/欧美| 女人被狂操c到高潮| 黄色成人免费大全| or卡值多少钱| 久久伊人香网站| 成人av一区二区三区在线看| 亚洲五月天丁香| 高潮久久久久久久久久久不卡| 日韩欧美国产一区二区入口| 男插女下体视频免费在线播放| 天堂av国产一区二区熟女人妻| 无遮挡黄片免费观看| 18美女黄网站色大片免费观看| 欧美最黄视频在线播放免费| 九九久久精品国产亚洲av麻豆| 一级作爱视频免费观看| 欧美不卡视频在线免费观看| 免费高清视频大片| 国产精品久久久久久久久免 | av天堂在线播放| 色播亚洲综合网| 日韩大尺度精品在线看网址| 精品国产三级普通话版| 亚洲人成伊人成综合网2020| 精品国产亚洲在线| 岛国在线免费视频观看| 18+在线观看网站| 草草在线视频免费看| 精品一区二区三区视频在线 | 嫁个100分男人电影在线观看| 在线国产一区二区在线| 99精品欧美一区二区三区四区| 老司机午夜十八禁免费视频| 老司机在亚洲福利影院| 90打野战视频偷拍视频| 最近最新中文字幕大全电影3| 又粗又爽又猛毛片免费看| 亚洲aⅴ乱码一区二区在线播放| 亚洲在线自拍视频| 久久人人精品亚洲av| 久久久久国内视频| 亚洲国产精品合色在线| 国产一区二区在线av高清观看| 欧美成人免费av一区二区三区| 亚洲电影在线观看av| 俄罗斯特黄特色一大片| 亚洲在线自拍视频| 国产伦精品一区二区三区视频9 | 久久久久久人人人人人| 高清毛片免费观看视频网站| www日本在线高清视频| 中文字幕人妻丝袜一区二区| 国产精品永久免费网站| 色在线成人网| 午夜福利视频1000在线观看| 色老头精品视频在线观看| 一二三四社区在线视频社区8| 亚洲欧美精品综合久久99| 午夜福利在线观看免费完整高清在 | 国产蜜桃级精品一区二区三区| 午夜激情欧美在线| 国产精品一区二区三区四区免费观看 | 少妇的丰满在线观看| 国产一级毛片七仙女欲春2| 我要搜黄色片| 午夜视频国产福利| 亚洲国产欧洲综合997久久,| 尤物成人国产欧美一区二区三区| 日韩 欧美 亚洲 中文字幕| 别揉我奶头~嗯~啊~动态视频| 成人国产综合亚洲| 制服丝袜大香蕉在线| 亚洲在线观看片| 欧美在线黄色| 国产日本99.免费观看| 亚洲国产高清在线一区二区三| 男女之事视频高清在线观看| 亚洲国产欧美网| 色吧在线观看| 精品99又大又爽又粗少妇毛片 | 亚洲国产欧美人成| 久久人妻av系列| 91av网一区二区| av女优亚洲男人天堂| 国产精品一区二区三区四区免费观看 | 内地一区二区视频在线| 成人av一区二区三区在线看| 禁无遮挡网站| 内射极品少妇av片p| 国产av不卡久久| 内地一区二区视频在线| av国产免费在线观看| 一进一出抽搐动态| 12—13女人毛片做爰片一| 嫩草影院入口| 91九色精品人成在线观看| 欧美黑人欧美精品刺激| 成人18禁在线播放| 色噜噜av男人的天堂激情| 国产成人aa在线观看| 亚洲人成网站在线播放欧美日韩| 51午夜福利影视在线观看| 国产精品野战在线观看| 88av欧美| 欧美成人免费av一区二区三区| 成人国产一区最新在线观看| 久久九九热精品免费| 手机成人av网站| 黄色视频,在线免费观看| 人妻丰满熟妇av一区二区三区| 午夜a级毛片| 亚洲成a人片在线一区二区| 人妻夜夜爽99麻豆av| 色精品久久人妻99蜜桃| 成年版毛片免费区| 最近最新中文字幕大全免费视频| 熟妇人妻久久中文字幕3abv| 国产精品爽爽va在线观看网站| 国产 一区 欧美 日韩| 日韩 欧美 亚洲 中文字幕| 久久久久性生活片| 亚洲av成人不卡在线观看播放网| 国产美女午夜福利| 久久久久久九九精品二区国产| 成人无遮挡网站| 757午夜福利合集在线观看| 色综合站精品国产| 日日夜夜操网爽| 国产精品,欧美在线| 一卡2卡三卡四卡精品乱码亚洲| 此物有八面人人有两片| 一级作爱视频免费观看| 999久久久精品免费观看国产| 嫩草影院入口| 国产精品1区2区在线观看.| 性色av乱码一区二区三区2| 久久久久久久久中文| 免费高清视频大片| 欧美一级a爱片免费观看看| 亚洲美女视频黄频| 欧美日韩亚洲国产一区二区在线观看| 啦啦啦免费观看视频1| 99热6这里只有精品| 欧美日韩黄片免| 亚洲精华国产精华精| 日本熟妇午夜| а√天堂www在线а√下载| 久久久久久久久中文| 国产av不卡久久| 最新美女视频免费是黄的| 久久精品国产自在天天线| 国产97色在线日韩免费| 国产av在哪里看| 女警被强在线播放| 日韩免费av在线播放| 久久性视频一级片| 精品一区二区三区视频在线观看免费| 美女cb高潮喷水在线观看| av天堂中文字幕网| 最后的刺客免费高清国语| 欧美精品啪啪一区二区三区| 国产伦人伦偷精品视频| 在线十欧美十亚洲十日本专区| 18禁在线播放成人免费| 91九色精品人成在线观看| 久久久国产精品麻豆| 久久久久久久亚洲中文字幕 | 欧美成人一区二区免费高清观看| 国产乱人伦免费视频| 天美传媒精品一区二区| 两性午夜刺激爽爽歪歪视频在线观看| 香蕉丝袜av| 在线播放无遮挡| 成人特级av手机在线观看| 高潮久久久久久久久久久不卡| 在线免费观看的www视频| 级片在线观看| 岛国视频午夜一区免费看| 中文字幕人妻熟人妻熟丝袜美 | 啦啦啦韩国在线观看视频| 99热只有精品国产| 久久久精品大字幕| 久久精品影院6| 亚洲在线观看片| 久久久久久久久久黄片| 精品一区二区三区av网在线观看| 一二三四社区在线视频社区8| 日本一本二区三区精品| 欧美黑人欧美精品刺激| 欧美最黄视频在线播放免费| 久久精品综合一区二区三区| 免费av观看视频| 琪琪午夜伦伦电影理论片6080| 精品国内亚洲2022精品成人| 变态另类丝袜制服| 中文字幕高清在线视频| 一个人免费在线观看的高清视频| 午夜日韩欧美国产| 亚洲av一区综合| 久久精品国产综合久久久| 丰满乱子伦码专区| 亚洲中文日韩欧美视频| 叶爱在线成人免费视频播放| 无限看片的www在线观看| 欧美一区二区亚洲| 欧美精品啪啪一区二区三区| 国产精品 国内视频| 激情在线观看视频在线高清| 母亲3免费完整高清在线观看| 老熟妇乱子伦视频在线观看| 一边摸一边抽搐一进一小说| 亚洲电影在线观看av| 禁无遮挡网站| 久久久久久人人人人人| 免费在线观看亚洲国产| 精品久久久久久,| 欧美3d第一页| 中文字幕人成人乱码亚洲影| 久久伊人香网站| 国产蜜桃级精品一区二区三区| 高清在线国产一区| 国产精品免费一区二区三区在线| 久久久精品大字幕| 欧美一区二区国产精品久久精品| 搡老熟女国产l中国老女人| 偷拍熟女少妇极品色| 熟妇人妻久久中文字幕3abv| 啪啪无遮挡十八禁网站| 中文亚洲av片在线观看爽| 成人无遮挡网站| 一区二区三区激情视频| 国产淫片久久久久久久久 | 五月伊人婷婷丁香| 亚洲第一电影网av| 日韩欧美三级三区| 脱女人内裤的视频| 天天添夜夜摸| 国产探花极品一区二区| 亚洲国产精品成人综合色| 激情在线观看视频在线高清| 日韩欧美精品免费久久 | 中文字幕人妻丝袜一区二区| 麻豆一二三区av精品| 一个人免费在线观看的高清视频| 亚洲一区二区三区不卡视频| 丝袜美腿在线中文| 变态另类成人亚洲欧美熟女| tocl精华| 伊人久久大香线蕉亚洲五| 在线看三级毛片| 欧美av亚洲av综合av国产av| 内地一区二区视频在线| 深夜精品福利| 亚洲中文日韩欧美视频| 51午夜福利影视在线观看| 手机成人av网站| 在线观看午夜福利视频| 国产99白浆流出| 操出白浆在线播放| 国产男靠女视频免费网站| 亚洲第一欧美日韩一区二区三区| 国产免费av片在线观看野外av| 国产综合懂色| 女生性感内裤真人,穿戴方法视频| 中文字幕av在线有码专区| 免费看美女性在线毛片视频| 国产av在哪里看| 免费看日本二区| 亚洲怡红院男人天堂| 真实男女啪啪啪动态图| 夫妻午夜视频| 伊人久久精品亚洲午夜| 国模一区二区三区四区视频| 边亲边吃奶的免费视频| 久久久久性生活片| 久久久久精品久久久久真实原创| 十八禁国产超污无遮挡网站| 国产成人福利小说| 欧美激情国产日韩精品一区| av国产久精品久网站免费入址| 精品国产一区二区三区久久久樱花 | 久久久亚洲精品成人影院| 国产亚洲一区二区精品| 草草在线视频免费看| 国产一区有黄有色的免费视频 | 国产又色又爽无遮挡免| 成人漫画全彩无遮挡| 天天一区二区日本电影三级| 欧美日韩国产mv在线观看视频 | 国产高清不卡午夜福利| 国产精品嫩草影院av在线观看| 国产爱豆传媒在线观看| 亚洲精品中文字幕在线视频 | 免费看日本二区| 国产伦精品一区二区三区视频9| 亚洲人成网站高清观看| 我要看日韩黄色一级片| 亚洲av二区三区四区| 午夜精品在线福利| 国产亚洲5aaaaa淫片| 欧美日韩亚洲高清精品| 亚洲自偷自拍三级| 日日摸夜夜添夜夜添av毛片| 女人十人毛片免费观看3o分钟| 熟女电影av网| 午夜激情福利司机影院| 日韩中字成人| 99热这里只有是精品50| 淫秽高清视频在线观看| 免费无遮挡裸体视频| 男人和女人高潮做爰伦理| 2022亚洲国产成人精品| 丝瓜视频免费看黄片| 国产亚洲91精品色在线| 91狼人影院| 国产精品久久久久久av不卡| 国产高潮美女av| 最近的中文字幕免费完整| 亚洲精品成人久久久久久| 高清在线视频一区二区三区| 欧美另类一区| 国产精品.久久久| 国内精品宾馆在线| 黄片无遮挡物在线观看| videos熟女内射| 亚洲欧洲日产国产| 国产精品一区二区三区四区免费观看| 在现免费观看毛片| 亚洲国产精品sss在线观看| 亚洲图色成人| 久久久久久久午夜电影| 成年版毛片免费区| 亚洲欧美中文字幕日韩二区| 婷婷色综合大香蕉| 日韩视频在线欧美| 亚洲精品久久午夜乱码| 午夜福利在线观看免费完整高清在| 久久精品国产亚洲网站| www.色视频.com| 亚洲av中文字字幕乱码综合| 亚洲国产精品成人综合色| 我的老师免费观看完整版| 欧美3d第一页| 亚洲av免费在线观看| 亚洲国产日韩欧美精品在线观看| 国模一区二区三区四区视频| 亚洲av国产av综合av卡| 精品亚洲乱码少妇综合久久| 国产视频首页在线观看| 成人亚洲精品一区在线观看 | 欧美另类一区| 三级经典国产精品| 最近2019中文字幕mv第一页| 国产一区二区在线观看日韩| 国产淫语在线视频| 欧美zozozo另类| 成人毛片60女人毛片免费| 亚洲精品中文字幕在线视频 | 九九爱精品视频在线观看| 少妇的逼好多水| 黄色配什么色好看| 好男人在线观看高清免费视频| 精品人妻视频免费看| 国产一区有黄有色的免费视频 | 午夜免费观看性视频| 欧美zozozo另类| 日韩欧美 国产精品| 中文字幕亚洲精品专区| 一边亲一边摸免费视频| 国产精品一区二区三区四区免费观看| 国产精品久久久久久精品电影小说 | 99re6热这里在线精品视频| 亚洲欧美清纯卡通| 国产精品无大码| 国产日韩欧美在线精品| 亚洲精品久久午夜乱码| 永久免费av网站大全| 亚洲国产成人一精品久久久| 精品久久久久久久久av| 亚洲经典国产精华液单| 国产中年淑女户外野战色| 国产成人91sexporn| 80岁老熟妇乱子伦牲交| 一区二区三区免费毛片| 只有这里有精品99| 又黄又爽又刺激的免费视频.| 国产综合精华液| 国产成人精品久久久久久| 久久草成人影院| 五月玫瑰六月丁香| 免费大片黄手机在线观看| 国产人妻一区二区三区在| 亚洲欧美成人综合另类久久久| 少妇的逼水好多| 午夜老司机福利剧场| 国产精品福利在线免费观看| 国产日韩欧美在线精品| 欧美zozozo另类| 91午夜精品亚洲一区二区三区| 久久精品国产亚洲av天美| 国产高清国产精品国产三级 | 少妇高潮的动态图| 51国产日韩欧美| 婷婷六月久久综合丁香| 国产精品不卡视频一区二区| 禁无遮挡网站| eeuss影院久久| 国产在线男女| 国语对白做爰xxxⅹ性视频网站| 亚洲欧美精品专区久久| 99热这里只有是精品50| 国精品久久久久久国模美| 亚洲精品aⅴ在线观看| 亚洲不卡免费看| 噜噜噜噜噜久久久久久91| 夜夜看夜夜爽夜夜摸| 极品教师在线视频| 精品久久久久久久久av| 国产高清不卡午夜福利| 成人鲁丝片一二三区免费| 亚洲av福利一区| 看十八女毛片水多多多| 青春草国产在线视频| xxx大片免费视频| 伦理电影大哥的女人| 韩国av在线不卡| 国产精品人妻久久久影院| 岛国毛片在线播放| 一级黄片播放器| 国产成年人精品一区二区| 国产一区二区三区综合在线观看 | 一级毛片我不卡| 国产三级在线视频| 国产探花在线观看一区二区| 美女内射精品一级片tv| 久久99精品国语久久久| 中文字幕久久专区| 精品久久久久久久末码| 国产真实伦视频高清在线观看| 国产欧美日韩精品一区二区| 精品久久久精品久久久| 国产一区二区三区av在线| 一级毛片aaaaaa免费看小| 丰满少妇做爰视频| av国产免费在线观看| 人人妻人人看人人澡| 亚洲四区av| 国内揄拍国产精品人妻在线| 免费av观看视频| 在线 av 中文字幕| 色综合亚洲欧美另类图片| 伊人久久精品亚洲午夜| 免费黄色在线免费观看| 久久久久久久久久黄片|