• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Exploring the Approaches to Data Flow Computing

    2022-08-24 03:26:36MohammadKhanAbdulKhanandHasanAlkahtani
    Computers Materials&Continua 2022年5期

    Mohammad B.Khan,Abdul R.Khanand Hasan Alkahtani

    1Department of Electrical and Computer Engineering,Technical University of Munich,Arcisstrasse 21,80333 Munich,Germany

    2Department of Computer Science,CCSIT,King Faisal University,31982,Al Ahsa,KSA

    Abstract: Architectures based on the data flow computing model provide an alternative to the conventional Von-Neumann architecture that are widely used for general purpose computing.Processors based on the data flow architecture employ fine-grain data-driven parallelism.These architectures have the potential to exploit the inherent parallelism in compute intensive applications like signal processing, image and video processing and so on and can thus achieve faster throughputs and higher power efficiency.In this paper, several data flow computing architectures are explored,and their main architectural features are studied.Furthermore, a classification of the processors is presented based on whether they employ either the data flow execution model exclusively or in combination with the control flow model and are accordingly grouped as exclusive data flow or hybrid architectures.The hybrid category is further subdivided as conjoint or accelerator-style architectures depending on how they deploy and separate the data flow and control flow execution model within their execution blocks.Lastly,a brief comparison and discussion of their advantages and drawbacks is also considered.From this study we conclude that although the data flow architectures are seen to have matured significantly,issues like data-structure handling and lack of efficient placement and scheduling algorithms have prevented these from becoming commercially viable.

    Keywords: Processor architecture; data flow architectures; Von-Neumann model; control flow architectures

    1 Introduction

    With the increase in applications that demand high computational power while being power efficient, the need for efficient parallel computing resources in general purpose, as well as application specific processors, has increased manifold recently.The traditional Von-Neumann architecture is inherently sequential as it employs the program counter, and sequences through the program instructions.Although, a number of parallel processing techniques are deployed in the modern Von-Neumann processors, there are still applications like those in signal processing,network processing etc.which are inherently parallel in nature.To exploit this inherent parallelism,a contrasting processing paradigm, called data flow processing was proposed and the early proponents of this model of computation believed that it would provide extensive computing power and could replace the traditional control-flow architecture.

    Data flow computing provides a few major advantages over control flow for parallel processing.Firstly, the data flow model is asynchronous where instruction execution is determined by the availability of operands hence providing an implicit synchronization of parallel instructions.Secondly, the data flow graph representation eliminates the need for explicit management of parallel execution of program by exposing the inherent parallelism of the application.

    The first proposals for data flow machines were made in the 70s [1,2].In the following decades, a number of other proposals for such processors were made; however, most of these were based on distributed set of processing elements (PEs) and given the costs of implementing large PEs on silicon, none of these were economically viable.There was a renewed interest in data flow computing in the early 2000s, and some of the proposed designs included support for imperative languages [3,4].Furthermore, to overcome the inefficiencies in the pure data flow processors, there was a shift towards adopting a hybrid model which combined both Von-Neumann and data flow models.Although, data flow processing has been adopted in some accelerators, to the best of our knowledge, there has been no commercially viable general-purpose data flow processor till date.This is because of some of the key issues in this model of computation like efficient allocation of data flow graphs still require additional research [5].

    In this paper, we explore the most prominent data flow architectures proposed till date.The aim of the paper is to provide a comprehensive review of the data flow processor architectures that provide an alternative to the conventional Von-Neumann processor.The data flow architectures,even though promising, have failed to become commercially successful and have been limited to academic research.This paper analyses the drawbacks of the various previously proposed architectures, which have proved to be an impediment in the progress of the dataflow architecture towards a mature technology.A few of the remaining challenges are highlighted which may facilitate the future research work in this particular area.

    The paper is organized as follows.First, we begin with a brief conceptual overview of the two programming models in the second section.In the third section, the data flow processors proposed in the literature are described in terms of their micro-architecture and execution model and are accordingly categorized.A comparative discussion of the most prominent features of these processors is done in the fourth section.Finally, the last section concludes the work and identifies a few future directions in the design of data flow processors.

    2 Control Flow and Data Flow Architectures

    This section covers the preliminaries and basic concepts behind the two contrasting computing models.

    2.1 Control Flow Model

    The control flow or commonly referred to as the Von Neumann architecture is the most successful and commercially viable computing model prevalent till date.It mainly consists of a processing unit for computation, a control unit for loading the instructions, memory system for storing data and instructions and an input/output interface.One of the defining features of this architecture is the program counter (PC), which is a register used to transfer the control between instructions.The PC enforces a sequence on the instruction in which the program is to be executed by holding the value of the instruction address that shall be executed next.The PC gets the next address either by automatically incrementing the previous address or could be directed explicitly by means of a branch or jump instruction.

    The sequential execution of programs in this architecture leads to memory ordering schemes that define the order in which the memory operations occur, that is, according to the order in which they were fetched.All the imperative programming languages are built with these memory semantics taken into consideration.As a consequence, all the programs have to be written in a manner which enforces this sequential execution even if there is no inherent sequential execution requirement in the program.This, along with other hazards, leads to underutilization of hardware resources and throughput bottleneck and is a major drawback of this architecture.

    Fig.1 depicts the execution of a simple computation in a control-flow manner.The program counter is incremented and at each increment performs the corresponding computation.Assuming all the initial operands are independent, this restricts parallelism as the two addition operations could have been performed in parallel.To overcome this issue several techniques are employed to exploit the parallelism in the programs at different granular levels.This could be Instruction-Level Parallelism (ILP) e.g., Instruction pipelining and Out-of-order execution, Thread-Level Parallelism(TLP) e.g., multi-threading, or Data-Level Parallelism (DLP).

    Figure 1:Execution of a simple computation in a control flow manner using a program counter

    2.2 Data Flow Model

    The data flow architecture provides a completely contrasting computing model to the conventional Von-Neumann architecture.It eliminates the need of the program counter, as the instructions are no longer executed in a sequential manner.This architecture provides the means to exploit the inherent parallelism in a program.The execution in a data flow machine is driven by the availability of operands and the execution resource.That is, an instruction ‘fires’as soon as the operands arrive at a free execution node of the hardware.As such, if all the operands corresponding to multiple instructions become available, they can be executed in parallel.The programs are represented by means of a data flow graph (DFG),G(V,E) which consists of vertices and edges where each vertexv∈Vrepresents the instruction and each edgee∈Erepresents the data-dependency between the instructions.Data-packets or values that propagate along the paths are commonly referred to as tokens.

    Taking the example defined in the previous section the corresponding DFG can be represented as shown in Fig.2.Each of the node of the DFG will fire as soon as its operands arrive.As such, the two addition operations would execute in parallel.

    Figure 2:Data flow graph representation of the computation

    The data flow machines are traditionally classified as Static and Dynamic data flow [6].The static data flow allows only one token per arc that is; only a single instance of an instruction can be executed at once [1,7–9].The Dynamic version on the other hand allows multiple tokens belonging to different instances of the instructions per arc, where each token is tagged with the instance number.Hence, it is also referred to as tagged-token data flow machine [10–12].

    3 Classification of Data Flow Architectures

    In this section, we aim to classify the recently proposed data flow processors.The classification is done in a hierarchical manner.First, we broadly classify the architectures into two groups based on execution model, that is, whether they implement an exclusive data flow execution or combine the control-flow and data flow techniques to follow a hybrid model.Next, within the hybrid model we further classify the architectures based on whether the scheduling scheme implements the data flow and control-flow execution on a unified hardware substrate or statically offloads the data flow portion to a separate unit.

    3.1 Exclusively Data Flow Architectures

    The first category of data folw architectures consists of those, which exclusively implement the data folw execution model.These are generally aimed at replacing the Von-Neumann architecture in general purpose computing.The first proposal for such a machine was proposed by Dennis and Misunas [1] in 1974, and in the following decades a number of such machines were proposed[7,10,11,13].However, none of these architectures could realize the true potential expected out of a data folw machines due to issues like memory programming difficulties, token matching overhead etc.More recently, a few new proposals have been made and two of them are discussed in detail below.

    WAVESCALAR:The Wavescalar is a dynamic data folw machine that is comprised of a distributed set of PEs which are hierarchically organized into tiles.It was proposed by a team at the University of Washington in 2003 [14].

    Architecture Overview:The Wavescaler programs are executed on a tile-based distributed network of processing elements called the WaveCache as shown in Fig.3.The processing element implements a 5-stage execution pipeline consisting of these stages - Input, Match, Dispatch,Execute, and Output.Two PEs are grouped together to form a pod, which communicate the ALU results via a common bypass network.Four such pods together form a domain, and four domains together for a cluster.The cluster can be regarded as the basic building block of the WaveCache as several such clusters can be connected together in a 2- dimensional mesh network to form a scalable processing substrate.The PEs inside a domain communicate by means of a pipelined bus, while as the inter-domain communication occurs via a pseudo-PE called NET pseudo-PE.A similar pseudo-PE, called MEM pseudo-PE is used as a gateway for memory operations for each domain.The inter-cluster communication is packet-based, and each cluster contains a network switch, which routes the messages from six ports – four for North, East, South,and West directions, one for store buffer and L1 data cache and one is shared among the NET PEs of the domains within the cluster.

    Figure 3:Wavecache Architecture [14].(a) Tile view, (b) Cluster view

    Each cluster has one store Buffer, which is responsible for enforcing the memory-ordering scheme, called the Wave-ordered memory.It works by dividing the program into waves, and within each wave the memory access instructions are annotated to enforce the memory sequence.As such, the ordering works at two levels – coarse grain by means of wave numbers, and fine grain by means of instruction annotations [3].From the programming perspective each instruction has a dedicated PE, however, for practicality, a set of 64 instructions are dynamically assigned to each PE.As the working set of instructions changes, the Wavescalar replaces the unused instructions with new ones.

    Execution Model:The key aspect of WaveScalar ISA is that it supports the conventional load/store semantics and hence can execute programs written in imperative languages.A program on a Wave-Scalar is executed in forms of ‘Waves’, which are basically acyclic and directed portions of the DFG.The compilation step includes the conventional steps like optimization and parsing and additionally the transformation to make the graph suitable for execution on Wavecache.It includes decomposing the graph into waves.Waves are similar to ‘hyperblocks’, and the data that traverses through the waves is annotated with the corresponding wave number, which are incremented using the ‘WAVE-ADVANCE’instructions.This allows the instructions to operate on the different instances of the instructions.

    To enforce memory ordering, the Wave-scalar annotates the memory instructions within a wave with sequence numbers resulting in a chain of operations.For example, Fig.4 shows a simple load store sequence with the corresponding sequence numbers.The first element is the sequence number of the predecessor, the second is that of the instruction itself, and the last one is that of the successor.The ‘.’Symbol represents that there is no predecessor or successor instructions, e.g., in the case when the instructions are first or last respectively.The sequence numbers are assigned in increasing order, such that the instruction with the larger number should be executed after the instruction with the smaller sequence number.In cases where there are a number of predecessor or successor paths, a wild card symbol ‘?’is employed.

    Figure 4:Simple Wave-ordered annotations

    In order to ensure that there is a complete chain of memory operations along every path, a MEMORY-NOP instruction is used in places where there are no memory operations in a path of a branch.Furthermore, independent memory operations are annotated with a ripple number to allow parallel execution.Along with the wave-ordered memory, the Wavescalar also provides support for an unordered memory access scheme to avoid any unnecessary ordering of memory instructions, and both can be used interchangeably within the same graph as well as the same wave.

    3.2 Hybrid Data Flow/Control-Flow Architectures

    As mentioned in the earlier sections, both the Von-Neumann as well as exclusively data folw machines have their drawbacks and strengths.While the former is simple to implement and well suited for sequential applications, the latter is useful for exploiting the maximum parallelism.As such, to combine the benefits of the two different approaches, several proposals were made which employ a combination of both the models [4,5,15–17].Some of these employ control folw execution between ‘execution blocks’and data folw execution within the blocks e.g., Tera-op,Reliable, Intelligently adaptive Processing System (TRIPS) [4], Dynamically Specialized Execution Resource (DySER) [15] and Tartan [16].On the other hand, other architectures schedule the blocks in a data-driven manner, while the instructions within a block are scheduled in a control folw manner e.g., MT-Monsoon [17], Task Superscalar [5].Based on the separation of the two execution models on the hardware, we further divide these into two categories as follows.

    3.2.1 Conjoint Architectures

    In a conjoint architecture, a program is scheduled using both data folw and control folw scheduling on a single execution substrate as both the models are inherent to the architecture.This means that there is no scope for executing an application which could benefit from using either control-folw or data folw exclusively.

    TRIPS:TRIPS is a dynamic, tile-based data folw architecture, which implements the Explicit Data Graph Execution (EDGE) ISA.It was proposed at University of Texas, Austin in 2003 [4].

    Architecture Overview:TRIPS processor chip consists of three main components – the processors cores, the integrated L2 cache organized into a number of tiled banks (M tiles), and a lightweight routing network (N tiles) as shown in Fig.5.Within each processor core, is a 4×4 network of execution nodes (ET), which is basically a single issue ALU tile consisting of integer and folating-point execution units, operand and instruction buffers and a router that enables communication with the neighboring ETs by means of a lightweight network referred to as‘micronets’.Further, it contains four register files (RT) at the top along with four Data (DT) and Instruction (IT) cache banks each.The tiles communicate with the L2 cache by means of the four ports.The Global Tile (GT) contains the I-cache tags, block header state and branch predictor and is responsible for managing the block execution.The compiler delineates 128 instruction blocks which are grouped into blocks of 8, at each of the execution unit.

    Figure 5:TRIPS architecture [4]

    Execution Model:The TRIPS compiler partitions the program into ‘hyperblocks’, where each block consists of 128 instructions and behaves as a ‘megainstruction’.These blocks are scheduled in a control flow mode, and the instructions within the block and executed in a data-driven manner with direct communication between the instructions.Hence, it can support imperative languages without much modification.

    The blocks are statically scheduled to the computation blocks such that the dependencies between the instructions is explicitly expressed.Within a block, each instruction sends its result to the consumer instructions and as soon as all the operands at the instruction arrive, it fires.As such, each TRIPS instruction only specifies the target locations of the results, which are statically determined by the compiler.

    To exploit more ILP, it provides support for up to 8 blocks of instructions to be executed simultaneously.That is, the G-tile can predict the next block of instructions, while it fetches and maps a block onto the execution array.The GT fetches the blocks by using its branch predictor,where it obtains the predicted block address and then uses the I-cache tags.If it hits, the block address is broadcasted to the I-cache banks, where each of them streams the block instructions into the execution array for their corresponding rows.

    3.2.2 Accelerator Style Architectures

    In data flow accelerator-style architectures, only a portion of the application is executed in a data flow fashion.The decision on which portion to accelerate is mostly static and is either taken by the programmer by means of application profiling or statically determined by the compiler.Furthermore, in some cases it is possible that the entire application may be executed without the data flow accelerator.We discuss two examples of such architectures in the following sub-sections.

    DySER:The Dynamically Specialized Execution Resource (DySER) architecture aims to combine both parallelism and functional specialization.It was proposed in 2012 at the University of Wisconsin – Madison [15].

    Architecture Overview:DySER works in conjunction with a general-purpose processor and is designed to be integrated into the execution phase of the pipeline as shown in Fig.6.It consists of a 2-dimensional array of heterogeneous Functional Units (FU).Each FU is connected to four simple switches (S), which form a circuit-switched network, delivering the data and control instructions into and out of the FUs by forming configurable data-paths.The FUs are configured by means of a configuration register, to perform a specific function and read its inputs from a specific switch.The pipelining is implemented by means of a credit-based folw control.The validity of the data at each FU is checked by means of forward ‘valid’signal and a backward‘credit’signal asserts the possibility of accepting a new data.For this purpose, the FUs also include the data and status registers.The configuration of DySER takes 64 cycles, and once configured it can be used multiple times for a given application phase.

    The communication with the host processor is enabled via a set of named I/O ports which correspond to FIFO buffers that deliver the data in and out of the switches.The RISC ISA is extended by five instructions for enabling DySER configuration and the communication of register and memory data between the host and the accelerator.

    Execution Model:The DySER compiler profiles the application to extract the commonly reused ‘a(chǎn)ccelerable’portions of the program and explicitly partitions the program into phases.The assumption is made that for a given program phase, only a few data folw blocks are active which are invoked multiple times.The DySER block is then confgiured to execute the DFG,before it is encountered.The register values are then sent to block or data is directly loaded from memory for each instance of the graph.This portion of the graph consisting of memory accesses is called ‘invocation slice’and the remaining portion with computation operations is called the ‘computation slice’.This separation enables the usual memory optimizations to be implemented without hurdles.As the data arrives at DySER, it is routed through the block as per the determined configuration in a data flow manner.It can also speculatively invoke the next instance of the configuration and pipeline their execution.

    Figure 6:DySER architecture and integration into execution pipeline [15]

    PLUG:The Pipeline Look-Up Grid (PLUG) is an application specifci data folw accelerator designed for optimally performing data structure lookups in a network processor.It was proposed at the University of Wisconsin-Madison in 2010 [16].

    Architecture Overview:As with other tile-based architectures, the PLUG tile consists of three regions – processing Cores (μCores), SRAM blocks and Routers.Fig.7 shows a typical PLUG tile, consisting of 32 cores (red), 4 memory blocks (green) and 6 routers (blue).These resources can be configured to form virtual tiles consisting of a subset of all the available resources.As shown in Fig.7, the tile is abstracted as consisting of 3 virtual tiles.This enables mapping different code-blocks with different computing, memory or routing requirements to a single physical tile and hence efficient utilization and helps to reduce scheduling losses.The complexity in wiring and associated overhead of configuring N cores M memories and R routers is simplified by implementing certain simple rules in the programming model which result in a set of four buses driven by tri-state buffers.

    The On-Chip network is very simple and requires no buffering or flow control.Again,contention-free routing is made possible by making use of certain code-generation rules.A restricted multi-cast is employed, whereby the message is delivered to all the nodes appearing on the path to the final destination and the compiler makes sure that the all the multi-cast targets fall on the route when multicast is required.

    TheμCores are simple 16-bit in-order processors which execute one thread and share the SRAM and routers.Memory access conflicts are avoided as a result of static instruction scheduling which guaranties that only one core will access the memory block.Memory access can be done in variable word sizes of 2, 3, 4, 6, 8, 12, 14 and 16 bytes.The PLUG ISA also provides another specialization to RISC ISA, whereby it includes additional formats for bit manipulations and for enabling on-chip network communication.

    Figure 7:PLUG architecture [16].(a) Tile organization, (b) Single tile partitioned into three virtual tiles

    Execution Model:The PLUG architecture works on the concept of transforming the datastructure lookups into a structured pipeline of memory and computation operations.It exploits the inherent structure of the lookup data-structure by mapping the data-structure to on-chip tiled storage.The parts of the data-structure representing one logical level are grouped to form a big‘logical page’with portions of the code, called Code-blocks, associated with each logical page.These code-blocks are agnostic to each other and perform the memory accesses independently within the scope of their own logical pages.Furthermore, they determine the next logical page that should be looked up next which results in a network message, which in turn, triggers the execution of another code-block.These logical pages are mapped to physical resources, after being explicitly portioned to match the storage space of the given tile.As such, these code-blocks associated with each logical page represent the nodes of the DFG and execution of programs is data flow driven by messages sent from tile to tile.

    Besides generating the assembly code for the code-blocks, the compiler is also responsible for partitioning the logical pages into smaller physical pages, which can then be mapped to the specific tiles.The logical pages are generally lists of data-blocks, and the compiler simply divides these into equal sized chunks such that these are small enough to fit in a single memory.To map the DFG to PLUG chip, a greedy scheduling algorithm is employed which does a breadth first traversal and assigns them to the tiles which are ranked based on the distance from the reference point (Input port).Moreover, the compiler also has the responsibility to assign the DFG arcs to the On-chip network for which a graph-coloring approach is employed.

    4 Comparison and Discussion

    In this section we compare the above-mentioned architectures and broadly discuss their advantages and drawbacks.Tab.1 summarizes the main features of these architectures.Features which present advantages are marked with (+) and those that are disadvantageous are marked with (-).The advantages and disadvantages are discussed in the text.

    Table 1:Comparison of the main features of the discussed architecture

    4.1 Exclusively Data Flow Architectures

    The Wavescalar is the most prominent architecture belonging to this class.Another recent FPGA-based architecture was proposed, called the Data Flow Soft Core (DFSC) in 2016 [18].While the Wavescalar is aimed to be a general-purpose processor, the DFSC on the other hand is designed for accelerating scientific computations.The first architectural difference between the two lies in the organization of the compute and routing resources.While the Wavescalar groups the PEs in a hierarchical manner with different interconnection methods at different levels, the DFSC organizes them in a flat manner with a custom crossbar.Although the flat organisation is simple to implement and manage, the hierarchical organization provides advantages in terms of communication latency, as the PEs in the pods can snoop on each other’s ALU.With the right placement algorithm employed, the frequently communicating instructions are placed as close to each other as possible, which could avoid long latency of a ‘mesh’interconnect.Furthermore, the packet based inter-cluster network is simplified by organising the pods into domains and making it possible to work at a higher abstraction level, without caring about individual PEs or pods.

    The tiled structure of Wavescalar enables building large computation substrate with up to 2000 PEs, by appending the tiles together, which are easy to reproduce.The DFSC on the other hand relies on the sufficient availability of compute and interconnection resources on the FPGA,however, making it relatively easy to implement.

    The support for imperative languages is the major advantage of the Wavescalar architecture as it eliminates the use of functional programming languages as in the case of many data flow processors.However, the Wavescalar literature does not cover the process of loading and termination of DFGs in much detail, which is possibly the main drawback of this architecture.Furthermore, the scheduling algorithm it employs, has only been experimentally evaluated, and its optimality hasn’t been proved formally [19].

    4.2 Conjoint Data Flow Architectures

    TRIPS effectively redefined the concept of data flow computing by combining the data flow scheduling with control flow.By allowing control-flow scheduling at for inter-block scheduling, it provides the support for imperative languages which is a major advantage.Along with exploiting ILP, it can also employ loop-unrolling or multithreading to exploit DLP or TLP respectively.

    The concept of ‘megainstructions’enables amortization of the sequential semantics over 128 instruction blocks.Moreover, the functional units in the substrate can be configured according to the desired application providing much flexibility to the processor.

    A major drawback that has been identified with TRIPS is the placement of instructions.The scheduler isn’t able to optimally map the instructions to minimize communication latencies and contention [17].

    4.3 Accelerator Style Architectures

    The accelerator-style data flow processors seem to be the most promising solutions currently.That is because they are mostly aimed at applications, which contain frequently recurring portions of computations that have little data dependencies, for example, in the case of signal processing algorithms where the same computation is performed on an array of values.The Von-Neumann architectures employ techniques like vectorization to implement such algorithms but given the availability of multiple execution resources in typical tiled substrates, data flow computing could potentially better exploit the parallelism in these applications.For instance, DySER was shown to outperform SIMD (Single Instruction Multiple Data), for several benchmarking applications like convolution and volume rendering [15].

    The PLUG architecture on the other hand is application specific and has been shown to perform competitively with other conventional specialized designs and even outperforms some in terms of power efficiency while providing more flexibility and programmability [16].The power efficiency can be attributed to the lightweight designs of the computation and communication resources, in addition to efficient parallelism exploitation using the data flow programming model.

    Most of the architectures in this category depend on the static determination of accelerable portions of the code by profiling and hence the configuration of the accelerators is determined statically.This limits the run-time adaptability and requirement of profiling information serve as major drawback for the architectures in this class.Furthermore, there are parallel applications,which have a high communication/computation ratio that limit the speedups of the application.

    5 Conclusions and Future Work

    In this paper a survey of the recent data flow architectures was presented.These were classified on the basis of the execution model that the architectures adopted and the separation of the two models in the case of hybrid architectures.Furthermore, a few advantages and drawbacks of the architectures belonging to each of these classes were discussed.This led to a number of questions that could be of interest and addressed in the future work.Firstly, the scheduling algorithms corresponding to each of the presented architectures could not be comprehensively understood.And as such, several questions still remain unanswered:a) How the algorithms ensure maximum utilization of each execution unit b) the efficiency of implemented pipelines c)the process of termination and replacement of DFGs.The Integer Linear Programming based generic scheduling framework [20] addresses the first two questions by applying certain constraints,however it generates around 20–25 constraints and as such the feasibility of such framework needs to be evaluated.Furthermore, the interplay of control cores and accelerators could not be considered in much detail and leaving the picture incomplete.Lastly, the support for imperative languages and memory ordering schemes was only covered for the Wavescalar architecture in detail and needs more work for the remaining architectures.

    There have been considerable advancements in the data flow computing architectures over the years, and some prototypes, particularly hybrid ones, have shown promising results.However, there are still a few challenges that have been identified which need further research before the data flow computing paradigm becomes a truly competitive alternative to Von-Neumann models in generalpurpose processors.The first major issue is the handling of data-structures e.g., arrays.Since the data flow processors work with ‘tokens’which are scalar values, the handling of a collection of tokens or a data-structure, poses serious challenges.The other major challenge being tackled is that of the optimal program allocation.The optimal selection of code-block granularity and the partitioning of programs into code-blocks, to maximize parallelism and minimize communication costs between the DFG nodes is an essential aspect which would considerably affect the success of the data flow machines and is being extensively researched.Lastly, optimally limiting the unrolling of loops to reduce resource requirements and handling of dynamic parallelism are other issues that is being addressed [5–21].

    Funding Statement:The authors received no specific funding for this study.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    男女国产视频网站| 一本久久精品| 女的被弄到高潮叫床怎么办| 亚洲丝袜综合中文字幕| 久久99一区二区三区| 国产麻豆69| 免费观看性生交大片5| 久久久久视频综合| 国产亚洲精品久久久com| 黑人巨大精品欧美一区二区蜜桃 | 欧美精品一区二区免费开放| 色哟哟·www| 春色校园在线视频观看| 新久久久久国产一级毛片| 伦理电影大哥的女人| 国产 一区精品| 久久久国产一区二区| 国产在视频线精品| 精品熟女少妇av免费看| 日韩三级伦理在线观看| 夜夜骑夜夜射夜夜干| 亚洲国产精品国产精品| 一级毛片黄色毛片免费观看视频| 精品久久久久久电影网| 国产精品一区二区在线不卡| av免费在线看不卡| 国产伦理片在线播放av一区| 宅男免费午夜| 蜜桃在线观看..| 少妇高潮的动态图| 国产熟女欧美一区二区| 日韩人妻精品一区2区三区| 男人操女人黄网站| 91久久精品国产一区二区三区| 亚洲精品久久午夜乱码| 女性被躁到高潮视频| 一级爰片在线观看| 国产成人精品无人区| 成人国产av品久久久| 精品人妻偷拍中文字幕| 一级a做视频免费观看| 日本欧美国产在线视频| 一级爰片在线观看| 一级,二级,三级黄色视频| 2018国产大陆天天弄谢| 少妇 在线观看| 亚洲欧美日韩另类电影网站| 午夜福利视频在线观看免费| 亚洲欧美精品自产自拍| 高清av免费在线| 又黄又粗又硬又大视频| 日韩一区二区视频免费看| 久久国产精品大桥未久av| 久久久久精品人妻al黑| 久久国产亚洲av麻豆专区| 黄色毛片三级朝国网站| 天堂8中文在线网| 男男h啪啪无遮挡| 夜夜骑夜夜射夜夜干| 在线免费观看不下载黄p国产| 中文字幕亚洲精品专区| 黑人巨大精品欧美一区二区蜜桃 | 欧美3d第一页| 久久国产亚洲av麻豆专区| 亚洲美女视频黄频| 亚洲美女视频黄频| 国产亚洲午夜精品一区二区久久| 啦啦啦中文免费视频观看日本| 丝袜脚勾引网站| 国产xxxxx性猛交| 男女免费视频国产| 中文字幕最新亚洲高清| 久久久久久人人人人人| 国产av精品麻豆| 国产精品国产三级国产av玫瑰| 日韩免费高清中文字幕av| 丝袜喷水一区| 欧美精品一区二区大全| 国产激情久久老熟女| 国产精品一区www在线观看| 赤兔流量卡办理| 18在线观看网站| 久久人人爽av亚洲精品天堂| 久久久久国产精品人妻一区二区| 99热6这里只有精品| 亚洲综合色惰| 亚洲欧美成人综合另类久久久| 亚洲国产成人一精品久久久| 国产精品99久久99久久久不卡 | 满18在线观看网站| 国产亚洲午夜精品一区二区久久| 妹子高潮喷水视频| 最近中文字幕2019免费版| 日日撸夜夜添| 水蜜桃什么品种好| 免费看不卡的av| 免费人妻精品一区二区三区视频| 99久久综合免费| videos熟女内射| 日本黄大片高清| 亚洲内射少妇av| 久久人人爽人人片av| 18禁国产床啪视频网站| 国产麻豆69| 久久国内精品自在自线图片| 少妇被粗大的猛进出69影院 | 99九九在线精品视频| videos熟女内射| 午夜福利网站1000一区二区三区| 黄色怎么调成土黄色| 狠狠精品人妻久久久久久综合| 90打野战视频偷拍视频| 一本色道久久久久久精品综合| 老司机亚洲免费影院| 久久婷婷青草| 啦啦啦在线观看免费高清www| 国产片特级美女逼逼视频| 久久精品aⅴ一区二区三区四区 | 草草在线视频免费看| 夜夜骑夜夜射夜夜干| 久久婷婷青草| 日韩精品有码人妻一区| 亚洲欧美中文字幕日韩二区| 国产欧美亚洲国产| 我的女老师完整版在线观看| 免费黄色在线免费观看| 看十八女毛片水多多多| 九色成人免费人妻av| 国产日韩欧美视频二区| 岛国毛片在线播放| 午夜激情久久久久久久| 青春草国产在线视频| 97在线人人人人妻| 黄色 视频免费看| 美女xxoo啪啪120秒动态图| 天堂8中文在线网| 国产精品免费大片| 日本午夜av视频| 一级a做视频免费观看| 亚洲,一卡二卡三卡| av女优亚洲男人天堂| 乱人伦中国视频| 亚洲在久久综合| 男女下面插进去视频免费观看 | 一本久久精品| 秋霞伦理黄片| 99香蕉大伊视频| 美国免费a级毛片| 欧美精品高潮呻吟av久久| 少妇被粗大的猛进出69影院 | 欧美人与性动交α欧美精品济南到 | 亚洲欧洲精品一区二区精品久久久 | 九色亚洲精品在线播放| 另类亚洲欧美激情| 高清视频免费观看一区二区| 蜜桃在线观看..| 亚洲色图综合在线观看| 亚洲图色成人| 久久久久人妻精品一区果冻| 国产亚洲午夜精品一区二区久久| 一边亲一边摸免费视频| 国产精品国产三级国产av玫瑰| 精品国产乱码久久久久久小说| 亚洲天堂av无毛| 香蕉国产在线看| 黑人猛操日本美女一级片| 日韩欧美精品免费久久| 黄色 视频免费看| 亚洲国产精品999| 欧美日韩成人在线一区二区| 黑人猛操日本美女一级片| 老司机影院成人| 亚洲欧美成人精品一区二区| 亚洲一级一片aⅴ在线观看| 一二三四在线观看免费中文在 | 久久久久精品久久久久真实原创| 亚洲欧美一区二区三区国产| 亚洲经典国产精华液单| 欧美日韩亚洲高清精品| 99视频精品全部免费 在线| 国产精品秋霞免费鲁丝片| 在线观看美女被高潮喷水网站| 亚洲国产精品专区欧美| 超色免费av| 日韩大片免费观看网站| 亚洲,一卡二卡三卡| 97人妻天天添夜夜摸| 黄色 视频免费看| 久久这里有精品视频免费| 亚洲成色77777| 蜜桃在线观看..| 久久女婷五月综合色啪小说| 成人毛片a级毛片在线播放| 国产极品粉嫩免费观看在线| 一级毛片电影观看| 大香蕉97超碰在线| 99久久人妻综合| 精品酒店卫生间| 如日韩欧美国产精品一区二区三区| www.熟女人妻精品国产 | 日本欧美视频一区| 黄色一级大片看看| 久久精品人人爽人人爽视色| 中文字幕另类日韩欧美亚洲嫩草| 男女免费视频国产| 99热网站在线观看| xxx大片免费视频| 9191精品国产免费久久| 夜夜爽夜夜爽视频| 男的添女的下面高潮视频| 高清av免费在线| 亚洲高清免费不卡视频| 亚洲欧美中文字幕日韩二区| 免费黄色在线免费观看| 蜜桃国产av成人99| 美女中出高潮动态图| 性高湖久久久久久久久免费观看| 亚洲av国产av综合av卡| 一级毛片 在线播放| 91精品伊人久久大香线蕉| 久热久热在线精品观看| 久久鲁丝午夜福利片| 男人操女人黄网站| 中文字幕最新亚洲高清| 在线 av 中文字幕| 日韩一区二区三区影片| 男人舔女人的私密视频| 午夜免费男女啪啪视频观看| 午夜福利在线观看免费完整高清在| 日本色播在线视频| 久久ye,这里只有精品| 中文乱码字字幕精品一区二区三区| 99久久人妻综合| 久久久久久久久久人人人人人人| 九九在线视频观看精品| 波野结衣二区三区在线| 亚洲,欧美精品.| 精品一区二区三卡| 老熟女久久久| 黄色 视频免费看| 99热这里只有是精品在线观看| 亚洲精品第二区| 另类精品久久| 十八禁高潮呻吟视频| 精品国产一区二区三区四区第35| 宅男免费午夜| 亚洲成色77777| 精品人妻偷拍中文字幕| 天堂中文最新版在线下载| 久久久欧美国产精品| 老女人水多毛片| 青春草视频在线免费观看| 欧美日韩一区二区视频在线观看视频在线| 成人国产麻豆网| 在线观看免费高清a一片| 内地一区二区视频在线| 国产高清不卡午夜福利| 色婷婷av一区二区三区视频| 最近手机中文字幕大全| 男人添女人高潮全过程视频| 午夜日本视频在线| 欧美性感艳星| 亚洲人与动物交配视频| 日本欧美视频一区| 少妇的逼水好多| 国产一区二区三区综合在线观看 | 伦理电影大哥的女人| 亚洲国产精品一区二区三区在线| 一边亲一边摸免费视频| 少妇 在线观看| 国产成人一区二区在线| 久久精品久久久久久噜噜老黄| 18禁在线无遮挡免费观看视频| 九九爱精品视频在线观看| 国产在线视频一区二区| 搡女人真爽免费视频火全软件| xxxhd国产人妻xxx| 国产av国产精品国产| 日韩免费高清中文字幕av| 97在线人人人人妻| 波野结衣二区三区在线| 波多野结衣一区麻豆| 免费大片18禁| 狠狠精品人妻久久久久久综合| 久久女婷五月综合色啪小说| 18禁裸乳无遮挡动漫免费视频| 男人爽女人下面视频在线观看| 亚洲精华国产精华液的使用体验| 街头女战士在线观看网站| 视频中文字幕在线观看| 在线观看www视频免费| 国产精品国产av在线观看| 女性生殖器流出的白浆| 一级片免费观看大全| 国产激情久久老熟女| 日韩欧美精品免费久久| 日日啪夜夜爽| 看十八女毛片水多多多| 日韩三级伦理在线观看| 九色亚洲精品在线播放| 免费看光身美女| 精品一区二区免费观看| 在线 av 中文字幕| 国产午夜精品一二区理论片| 全区人妻精品视频| 91精品国产国语对白视频| 曰老女人黄片| 久热久热在线精品观看| 91精品三级在线观看| 国产精品成人在线| 国产片内射在线| 久久精品久久久久久久性| 性高湖久久久久久久久免费观看| 十分钟在线观看高清视频www| 美女国产高潮福利片在线看| 熟女av电影| 视频区图区小说| 久久久久网色| 色5月婷婷丁香| 国产精品无大码| 九色成人免费人妻av| 2021少妇久久久久久久久久久| 啦啦啦啦在线视频资源| 人人妻人人爽人人添夜夜欢视频| 街头女战士在线观看网站| 久久99热这里只频精品6学生| 狠狠精品人妻久久久久久综合| 国产av码专区亚洲av| 午夜久久久在线观看| 欧美丝袜亚洲另类| 99精国产麻豆久久婷婷| 男女国产视频网站| 亚洲美女搞黄在线观看| 天堂中文最新版在线下载| 久久久国产一区二区| 在现免费观看毛片| 99久久人妻综合| 国产一区二区激情短视频 | 欧美3d第一页| 少妇高潮的动态图| 2018国产大陆天天弄谢| 丰满饥渴人妻一区二区三| a级片在线免费高清观看视频| 欧美日韩综合久久久久久| 国产精品国产三级专区第一集| 桃花免费在线播放| 亚洲欧美中文字幕日韩二区| 国产精品国产av在线观看| 男的添女的下面高潮视频| 欧美另类一区| 久久精品国产a三级三级三级| 一区二区三区精品91| 婷婷色综合www| 国产成人一区二区在线| 精品一区二区三区四区五区乱码 | 秋霞伦理黄片| 国产成人精品久久久久久| 一级片'在线观看视频| 亚洲五月色婷婷综合| 少妇高潮的动态图| 大片免费播放器 马上看| 大香蕉久久网| xxxhd国产人妻xxx| 日韩精品免费视频一区二区三区 | 欧美精品国产亚洲| 亚洲综合精品二区| 久久精品国产亚洲av天美| 久久久国产精品麻豆| 精品一区二区免费观看| 高清欧美精品videossex| 久久狼人影院| 91aial.com中文字幕在线观看| 欧美 亚洲 国产 日韩一| 一级毛片黄色毛片免费观看视频| 色婷婷久久久亚洲欧美| 欧美精品一区二区大全| 亚洲精品成人av观看孕妇| 视频在线观看一区二区三区| 黄色视频在线播放观看不卡| 蜜臀久久99精品久久宅男| av又黄又爽大尺度在线免费看| 日韩免费高清中文字幕av| 国产69精品久久久久777片| tube8黄色片| 美女脱内裤让男人舔精品视频| 免费av中文字幕在线| 欧美成人午夜精品| 美女脱内裤让男人舔精品视频| 免费大片18禁| 国产在线视频一区二区| 精品少妇内射三级| 亚洲成色77777| 精品国产一区二区久久| 一本大道久久a久久精品| 内地一区二区视频在线| 免费看光身美女| 午夜av观看不卡| 国产精品女同一区二区软件| 中文字幕制服av| 99香蕉大伊视频| 欧美3d第一页| 日韩一本色道免费dvd| 欧美 亚洲 国产 日韩一| 亚洲欧美一区二区三区黑人 | 免费观看在线日韩| 成人手机av| 国产av国产精品国产| 日韩欧美一区视频在线观看| 成人手机av| 又黄又粗又硬又大视频| av网站免费在线观看视频| 欧美xxxx性猛交bbbb| 精品熟女少妇av免费看| 国产av国产精品国产| 日韩视频在线欧美| 亚洲色图 男人天堂 中文字幕 | 高清不卡的av网站| 成人综合一区亚洲| 免费人妻精品一区二区三区视频| 日本欧美国产在线视频| 最近中文字幕2019免费版| 又黄又粗又硬又大视频| 久久免费观看电影| 欧美成人午夜精品| 永久免费av网站大全| 啦啦啦中文免费视频观看日本| 久久人人97超碰香蕉20202| 大陆偷拍与自拍| 国产精品人妻久久久久久| 如日韩欧美国产精品一区二区三区| 精品一区在线观看国产| 激情视频va一区二区三区| 成人亚洲欧美一区二区av| 男女国产视频网站| av线在线观看网站| 国产女主播在线喷水免费视频网站| 少妇猛男粗大的猛烈进出视频| 国产在线视频一区二区| 性色avwww在线观看| 欧美xxⅹ黑人| 人妻少妇偷人精品九色| 国产永久视频网站| 亚洲精品中文字幕在线视频| www.色视频.com| 丝袜喷水一区| 菩萨蛮人人尽说江南好唐韦庄| 国产亚洲一区二区精品| 国产成人精品无人区| 久久午夜综合久久蜜桃| 丝袜在线中文字幕| 伦理电影大哥的女人| 色吧在线观看| 欧美精品亚洲一区二区| 日韩,欧美,国产一区二区三区| 美女国产视频在线观看| 韩国av在线不卡| 日韩人妻精品一区2区三区| 97精品久久久久久久久久精品| 街头女战士在线观看网站| 久久久亚洲精品成人影院| 九色亚洲精品在线播放| 街头女战士在线观看网站| 欧美亚洲日本最大视频资源| 免费人妻精品一区二区三区视频| 亚洲人与动物交配视频| 永久免费av网站大全| 精品卡一卡二卡四卡免费| 日韩成人伦理影院| 日韩大片免费观看网站| 99热这里只有是精品在线观看| 国产成人精品福利久久| 中文字幕最新亚洲高清| 亚洲丝袜综合中文字幕| 国产精品秋霞免费鲁丝片| 久热久热在线精品观看| 国语对白做爰xxxⅹ性视频网站| 国产欧美亚洲国产| 亚洲国产精品一区三区| 成人午夜精彩视频在线观看| 婷婷色综合大香蕉| 97人妻天天添夜夜摸| 黄色配什么色好看| 日韩不卡一区二区三区视频在线| 91精品国产国语对白视频| 18+在线观看网站| 国产精品国产三级国产av玫瑰| 亚洲国产欧美日韩在线播放| 免费观看av网站的网址| 九色成人免费人妻av| 亚洲图色成人| 亚洲av.av天堂| 十八禁高潮呻吟视频| 久久这里只有精品19| 美女视频免费永久观看网站| 一级毛片 在线播放| 久久精品国产亚洲av天美| 免费在线观看黄色视频的| 国产精品久久久久久精品古装| 天天操日日干夜夜撸| 99九九在线精品视频| 尾随美女入室| 久久久久久伊人网av| 亚洲五月色婷婷综合| 人体艺术视频欧美日本| 一二三四中文在线观看免费高清| 伦精品一区二区三区| 亚洲欧洲精品一区二区精品久久久 | 久久久精品免费免费高清| 啦啦啦在线观看免费高清www| av在线老鸭窝| 99久久中文字幕三级久久日本| 亚洲图色成人| 最新的欧美精品一区二区| 2021少妇久久久久久久久久久| 色94色欧美一区二区| av天堂久久9| 免费看不卡的av| 视频区图区小说| 另类精品久久| 精品人妻偷拍中文字幕| 99久久精品国产国产毛片| 国产熟女欧美一区二区| 久久韩国三级中文字幕| 人人妻人人澡人人看| 熟女电影av网| 一区二区三区四区激情视频| 久久亚洲国产成人精品v| 久久午夜福利片| 亚洲精品乱码久久久久久按摩| 国产日韩欧美视频二区| 亚洲精品,欧美精品| 午夜91福利影院| 精品国产乱码久久久久久小说| 欧美日韩亚洲高清精品| 欧美日韩精品成人综合77777| 999精品在线视频| 大香蕉久久成人网| 在线观看www视频免费| 亚洲av欧美aⅴ国产| av国产久精品久网站免费入址| 国产在线视频一区二区| 亚洲欧洲精品一区二区精品久久久 | 国产1区2区3区精品| 亚洲精品美女久久久久99蜜臀 | 精品国产乱码久久久久久小说| 侵犯人妻中文字幕一二三四区| 亚洲欧美一区二区三区黑人 | 免费观看性生交大片5| 国产精品人妻久久久久久| 97在线视频观看| 80岁老熟妇乱子伦牲交| 欧美另类一区| 日本欧美视频一区| 高清av免费在线| 日韩欧美一区视频在线观看| 国产精品久久久久久av不卡| 国产精品 国内视频| 18禁观看日本| 国产国拍精品亚洲av在线观看| 亚洲精品久久成人aⅴ小说| 亚洲欧洲精品一区二区精品久久久 | 国产色爽女视频免费观看| 欧美日韩视频精品一区| 日韩一区二区三区影片| 久久99精品国语久久久| 侵犯人妻中文字幕一二三四区| 亚洲精品日韩在线中文字幕| 国产精品一二三区在线看| av在线观看视频网站免费| 纯流量卡能插随身wifi吗| 午夜激情久久久久久久| 国产成人精品福利久久| 少妇 在线观看| 美女主播在线视频| av线在线观看网站| xxx大片免费视频| 亚洲国产看品久久| 欧美精品国产亚洲| 日韩电影二区| 一本色道久久久久久精品综合| 久久精品国产a三级三级三级| 国产国语露脸激情在线看| 亚洲图色成人| 99热网站在线观看| 国产色爽女视频免费观看| 99久久中文字幕三级久久日本| 国产一区有黄有色的免费视频| 日韩大片免费观看网站| 亚洲成av片中文字幕在线观看 | 嫩草影院入口| 成年美女黄网站色视频大全免费| 午夜激情av网站| 婷婷色麻豆天堂久久| 免费看av在线观看网站| av免费观看日本| 18禁观看日本| 极品人妻少妇av视频| 高清在线视频一区二区三区| 日韩制服骚丝袜av| 欧美精品高潮呻吟av久久| 高清av免费在线| 宅男免费午夜| 国产成人免费无遮挡视频| 免费av不卡在线播放| 大香蕉97超碰在线| 视频在线观看一区二区三区| 欧美3d第一页| 成人国语在线视频| 亚洲国产日韩一区二区| 老司机亚洲免费影院| 超色免费av| 国产在线一区二区三区精| 精品人妻一区二区三区麻豆| 在线天堂最新版资源| 在现免费观看毛片|