• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    A Dynamically Recon gurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs

    2021-12-14 03:49:50YunpingZhaoJianzhuangLuandXiaowenChen
    Computers Materials&Continua 2021年1期

    Yunping Zhao,Jianzhuang Luand Xiaowen Chen

    National University of Defence Technology,Changsha,China

    Abstract:Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the computational efficiency,hardware utilization,or flexibility of CNN hardware accelerators.Accordingly,this paper proposes a dynamically reconfigurable accelerator architecture that implements a Sparse-Winograd F(2 × 2.3 × 3)-based high-parallelism hardware architecture.This approach not only eliminates the pre-calculation complexity associated with the Winograd algorithm,thereby reducing the difficulty of hardware implementation,but also greatly improves the flexibility of the hardware;as a result,the accelerator can realize the calculation of Conventional Convolution,Grouped Convolution(GCONV)or Depthwise Separable Convolution(DSC)using the same hardware architecture.Our experimental results show that the accelerator achieves a 3x–4.14x speedup compared with the designs that do not use the acceleration algorithm on VGG-16 and MobileNet V1.Moreover,compared with previous designs using the traditional Winograd algorithm,the accelerator design achieves 1.4x–1.8x speedup.At the same time,the efficiency of the multiplier improves by up to 142%.

    Keywords:High performance computing;accelerator architecture;hardware

    1 Introduction

    At present,CNNs are widely used in various fields impacting different areas of people’s lives,such as image classification[1],target recognition[2,3],semantic segmentation[4,5],and so on.However,with the development of CNNs,networks have become deeper and deeper,with some even reaching tens to hundreds of layers[6,7],introducing a huge workload to the system in terms of computing and storage.At the same time,due to the emergence of new convolution methods such as DSC and GCONV,the flexibility of CNN accelerators has become more and more important.

    Many solutions have been proposed to solve the associated problems.These include using the Graphics Processing Units(GPUs)[8],Application Specific Integrated Circuits(ASICs)[9–12],Field-Programmable Gate Arrays(FPGA)[13–15]and other hardware to facilitate the acceleration of CNNs.However,GPUs are impacted by a high power consumption problem,while ASCIs are affected by high production costs and a long development,which greatly limits their scope of application.

    As face recognition,motion tracking and other applications have become widely used in both mobile devices and embedded devices,the demand for CNNs in these devices is also increasing.Because embedded and mobile devices have more limited hardware resources and higher power consumption requirements,lightweight CNNs such as MobileNet[16,17],ShuffleNet[18],Xception[19]have been proposed one after another.However,due to the high computing and high throughput characteristics of CNNs,running even lightweight CNNs is also associated with huge challenges when it comes to improving on-chip computing performance,storage,bandwidth,and power consumption.At the same time,different CNNs or different convolutional layers usually have different convolution shapes,which also presents challenges to ensuring the flexibility of hardware accelerators.As a result,hardware architecture designers must improve data reusability and computing performance,reduce system operating power consumption,or reduce the bandwidth requirements.Moreover,such hardware must ensure that features are dynamically reconfigurable if it is to meet the system’s flexibility requirements when adapting to different lightweight CNNs.

    This paper proposes a fast decomposition method based on the Sparse-Winograd algorithm(FDWA);this approach extends the applicability of the Winograd algorithm to various convolution shapes and further reduces the computational complexity.Based on unified hardware implementation,it is able to provide a high level of configurability,to realize fast switching between different configurations for different convolution shapes at runtime.Generally speaking,the main advantages of this architecture are as follows:

    Using the FDWA can not only reduce the number of multiplications and shorten the loop iteration of the convolution,but also introduces dynamic sparseness into the network;this can reduce hardware workload,improve the speed of convolution execution,expand the use scope of the Sparse-Winograd algorithm,and simplify the hardware implementation.Moreover,compared with the conventional accelerator design,the proposed hardware architecture design also enhances the adaptability and flexibility of the accelerator,meaning that the accelerator can accelerate both Conventional Convolutions and the new types of convolution(such as DSC or GCONV)without the need to change the hardware architecture.

    This paper proposes a high-throughput 2D computing architecture with reconfigurable I/O parallelism,which improves PE Array utilization.A linear buffer storage structure based on double buffers is used to reduce the data movement overhead between all levels of storage and to hide the data movement time,such that the computing performance of the entire system runs at the peak computing speed of the PE Array.

    The remainder of this paper is organized as follows.The second section summarizes the related work.The third section briefly introduces the Winograd algorithm and outlines the operation of the FDWA in detail.Section 4 details the architecture of the accelerator,while Section 5 provides the implementation results and comparison.Finally,Section 6 presents the conclusion.

    2 Related Work and Motivation

    2.1 Related Work

    Previous studies have proposed a wide range of CNNs accelerator architectures.The DianNao[20,21]series achieved good end results by improving the parallel computing performance,enhancing the computing arrays,optimizing the data flow patterns,and increasing data reuse rates.Eyeriss[22]reduces data movement by maximizing local data reuse,applying data compression,implementing data gating,and achieving high energy efficiency.The Efficient Inference Engine(EIE)[23]introduces a sparse network to speed up the fully connected layer’s operation speed.Sparse CNN(SCNN)[24]accelerates the speed of convolution calculation under sparse network conditions by improving the coding and hardware structure.Zhang et al.[25]improve the convolution calculation by mapping the Fast Fourier transform(FFT)algorithm;for their part,Shen et al.[26]improve the computing speed of both 2D and 3D convolution by mapping the Winograd algorithm,thereby achieving good results on FPGA.The work of Wang et al.[27],based on the Fast Finite Impulse Response Algorithm(FFA),reduces the number of multiplication operations in the convolution operation;however,the flexibility of this design is so weak that it can only be accelerated for a fixed type of convolution.Lu et al.[28]designed an accelerator based on the fast Winograd algorithm,but primarily used F(6,3)and F(7,3)as the basic calculation unit.Due to the large number of parameters,the hardware accelerator in this case is required to use 16-bit fixed-point operands,along with more complex pre-calculations.Even so,however,very few accelerator designs for lightweight neural networks have been developed,although there are some designs for light-weight CNN acceleration or that aim to increase accelerator flexibility.

    In this article,in order to speed up operation,save on hardware resources,and improve the calculation efficiency,we designed the FDWA as a basic hardware design based on the Sparse-Winograd algorithm.Moreover,we go on to test the algorithm acceleration effect and hardware resource consumption via SystemC[29].

    2.2 Motivation

    Firstly,it is feasible in practical terms to accelerate CNNs at the algorithm level.Various advanced convolution algorithms have therefore been applied to accelerator design.Taking Winograd,FFT,and FFA algorithms as examples,these approaches reduce the number of multiplication operations by reducing the computational complexity,thereby further saving on resources.Compared with the FFT and FFA algorithms,the Winograd algorithm can achieve better results when the convolution kernel is smaller[8].In this paper,our approach is to solve different types of volume integrals and use fixed Sparse-Winograd F(2×2,3×3)for calculation;this not only simplifies the implementation,but also expands the application range of the Winograd algorithm.

    Secondly,the speed of CNN updating has accelerated of late,and there are usually great differences between the different convolutional layers in CNNs,meaning that higher requirements are placed on the flexibility of the accelerator.However,most studies based on algorithm acceleration typically only support a specific and fixed convolution type.At the same time,lightweight CNNs such as MobileNet and ShuffleNet have seen widespread use on mobile devices or embedded platforms,thereby reducing both the number of parameters and the computational complexity;one example is MobileNetV1[16],which achieves acceptable results with less than 50% of the MAC operations and parameters utilized by comparison methods.This was primarily made possible due to the use of new convolution methods in the MobileNetV1 structure,including Group Convolution(GCONV)and Depthwise Separable Convolution(DSC).Accordingly,there is an urgent need to solve the acceleration problem of the new convolution method when designing hardware accelerators.This paper therefore proposes a hardware architecture design for accelerating various types of CNNs based on FDWA;this approach solves the algorithm flexibility problem,thereby facilitating adaptation to different convolution types,and also solves the problem of the need to speed up the new type of convolution.

    3 Algorithm Analysis,Mapping and Improvement

    At present,there have been few studies on hardware accelerators for lightweight CNNs.While Bai et al.[30]and Zhao et al.[31]realized the acceleration of convolution calculations for lightweight CNNs on FPGA,the main drawback of these works is that the overall performance and flexibility remain insufficient.In light of this,the present paper proposes a block parallel computing method based on the Sparse-Winograd algorithm and a reconfigurable decomposition method for lightweight CNNs,which solves the problems associated with accelerating different convolution types.

    3.1 GCONV and DSC

    CNNs are primarily composed of convolutional layers,Rectified Linear Units(ReLUs),pooling layers,fully connected layers,etc.Of these components,the convolution layer has made a major contribution to CNN operation;however,it also takes up most of the calculation time.

    Following the emergence of lightweight CNNs,new convolution method have been widely used due to their outstanding advantages.DSC was first proposed in MobileNet V1,while GCONV first appeared in AlexNet.As shown in Tab.1[16],MobileNet achieves image recognition accuracy that is almost as high as,or even higher than,that of comparison methods,while still using fewer parameters and less calculations;this is achieved by means of DSC,which demonstrated its outstanding advantage on embedded platforms or mobile devices.

    Table 1:Comparison of MobileNet with popular models

    Fig.1 presents the execution process of Conventional Convolution,GCONV and DSC.In particular,Fig.1b describes the specific process of GCONV.This process involves dividing the input feature map into g groups according to the number of channels,such that the size of each input feature map group becomes M×H×W/g,while the size of the convolution kernel becomes M×K2/g.During the execution process,the filter only performs convolution operations with input feature maps from the same group;in this way,it can reduce the number of parameters required during the convolution operation.Fig.1c further describes the specific process of DSC.This process involves dividing the convolution process into two sub-processes:Depthwise Convolution and Pointwise Convolution.Firstly,each channel of the input feature map is independently convoluted via channel-by-channel convolution with a convolution kernel size of M×(K×K×1),after which it is deepened by point-by-point convolution.The weighted combination is then used to generate the output feature map,where the point-wise convolution kernel size is N×(1×1×M).

    Tab.2 specifically illustrates the difference in the number of parameters and the total amount of calculation for the three different convolution methods.Compared with Conventional Convolution,the new convolution method requires fewer resources and requires less external memory bandwidth requirements to achieve tolerable accuracy.Therefore,it is urgently necessary to develop a hardware architecture design hat can handle both Conventional Convolution and the new types of convolution.

    Figure 1:Conventional convolution,GCONV and DSC

    Table 2:Parameter and calculation amount of different convolution methods

    3.2 Sparse-Winograd Algorithm

    The Winograd algorithm is a new,fast algorithm for CNNs,which reduces the number of multiplication operations by transforming the input feature map and executing a series of transformations on the filter.Taking the one-dimensional Winograd algorithm as an example,the specific operation process can be expressed as follows:

    In Eq.(1),⊙indicates the multiplication between corresponding elements.Moreover,A,B,and G are constant transformation coefficient matrices,which are determined by the size of the input data.

    Similarly,the operation process of the 2D Winograd algorithm F(m×m,k×k)can be described by Eq.(2).In summary,the input feature matrix X and the filter matrix W are converted into the Winograd domain by the matrices B and G,After the dot product operation is performed,the A matrix is used for transformation.The transformation matrices B,G,and A are determined by the size of the input feature map and the size of the filter.

    It can be concluded from E(q.(2)tha)t the 2D Winograd algorithm only needs to execute(m+k-1)2multiplication operations,while m2×k2operations are required for Conventional Convolution.Fig.2[28]describes the calculation process and pseudocode implementation of the convolution based on the 2D Winograd algorithm and Conventional 2D convolution;in this case,the convolution stride of the Winograd convolution algorithm is 1.It can be seen from observing at the the algorithm level that the Conventional Convolution method requires six layers of loops,while the Winograd algorithm-based convolution only requires four layers of loops and reduces the amount of necessary multiplication by more than 60%[28].

    Figure 2:Comparison of conventional and Winograd convolution algorithms

    By conducting further research into the Winograd algorithm,Liu et al.[32]and others proposed the Sparse-Winograd algorithm,based on the original algorithmic structure.As shown in Fig.3[32],there are different methods of applying the Winograd algorithm to a sparse network.The linear transformation of neurons and convolution kernels will reduce the sparsity of the original network.Therefore,if the Winograd algorithm is used to directly accelerate the sparse network,the number of multiplication operands in the network will increase,such that the ensuing performance loss will outweigh the benefit of using the Winograd acceleration algorithm.

    Figure 3:The Winograd algorithm applied to different sparse network situations

    By moving ReLU and Prune to the Winograd domain,the number of multiplication operations can be reduced,while the overall efficiency can be further improved when the Winograd algorithm is used for acceleration;thereby,the acceleration of a sparse network based on the Winograd algorithm can be realized.

    3.3 Sparse-Winograd Algorithm Mapping and Improvement

    Although convolution operations based on the Sparse-Winograd algorithm can reduce the number of required multiplication operations and improve the operational efficiency,for different types of convolutions,the operation process is overall the same as the Winograd algorithm but requires more complex pre-operations,primarily as regards the calculation of the conversion matrix A,B and G.Moreover,the calculation of the conversion matrix is also difficult to realize via hardware;in addition,with increases in size of the input feature map or the convolution kernel,the data range of the parameters in the conversion matrix becomes very large,which presents challenges for the design of hardware,storage,bandwidth,and power consumption.The above problems restrict the development of CNN accelerators based on the Sparse-Winograd algorithm.

    In order to solve the above problems,we propose a method of FDWA based on the Sparse-Winograd algorithm,which operated by decomposing the input feature map and filter on the basis of the Sparse-Winograd F(2×2,3×3)algorithm.As shown in Eq.(4),the decomposition is based on F(2×2,3×3)and chiefly considers that the parameters in the transformation matrices A,B and G are(0,±1,±1/2).This is easy to achieve by shifting or adding hardware;not only is this hardware implementation simple and easy to control,but it also reduces the hardware cost.

    Fig.4 describes the process of F(2×2,5×5),where the Conventional Convolution steps after the input feature map and the filter are expanded by 0,along with the meaning of the parameter representation.Accordingly,Conv(K,W)can be expressed by Eq.(5);moreover,Eqs.(6)to(9)outline the methods by which specific parameters are calculated.

    Figure 4:Explanation of parameters in the conventional convolution process

    Fig.5 describes the FDWA process in detail,with F(2×2,5×5)as an example.As shown in the figure,firstly,the input feature map and the filter are expanded to 7×7 and 6×6 scaled by zero,after which the four corner 4×4 scale blocks on the feature map(K1,K2,K3,K4)are respectively convoluted with the four 3×3 scale blocks(W1,W2,W3,W4)on the convolution kernel;subsequently,the F(2×2,5×5)can be calculated by a set of F(2×2,3×3).Eqs.(10)to(13)describe each F(2×2,3×3)calculation result and the significance of the corresponding figures.

    Figure 5:FDWA

    As shown in Tab.3,the parameters are equivalent to Eqs.(6)to(9)and Eqs.(10)to(13).From Tab.3,we can also get the final process of the F(2×2,5×5)calculation based on F(2×2,3×3)in Fig.6.

    She stood, and brushed herself off. At first stunned11, the others waiting in line began to sporadically12 clap, and before I knew it they were cheering and laughing and patting my mother on her back. She blushed and took a little bow and the next thirty minutes in line was pure misery13 for me as various parents leaving the Layaway Department, shake their heads at me and say with a smile, Your mom got you good. I bet you ll never try that again.

    Table 3:Relationships between parameters for conventionl and new convolutions

    Figure 6:Decomposition calculation method of F(2×2,5×5)

    Fig.7 describes three situations based on the above FDWA,specifcially for F(2×2,5×5),F(2×2,6×6),and F(2×2,7×7).As can be seen from Fig.7b,the FDWA with filter size of 6 × 6 is the same as that of F(2×2,5×5);i.e.,the filter size is between 3 and 6,while the partition method is the same.Fig.7c describes the FDWA process in the case of F(2×2,7×7),which is also applicable to filter sizes between 7 and 9.When utilizing FDWA,it is easy to realize the accelerator design with F(2×2,3×3)as the hardware core.This design is not only simple to control,but also requires less pre-operation,expanding applicability and reducing hardware consumption.

    4 Accelerator Architecture Design

    In this chapter,the accelerator scheme will be comprehensively introduced.

    4.1 Overview of Accelerator Design

    Fig.8 illustrates the system-level architecture of the accelerator.The key components of the three modules are the processor,DRAM,Processing Units,and system interconnection bus.

    Figure 7:Decomposition calculation method of F(2×2,5×5),F(2×2,6×6),and F(2×2,7×7)

    Figure 8:The structure overview of accelerator

    1)Control module:This component is responsible for receiving the task information sent by the CPU.When the task is loaded on to the accelerator and calculations begin,the module will distribute the task to the calculation module,and prompt the DMA module to either move the data to the buffer or write to the external DDR through the interconnection configuration.After calculation is complete,the module will move the result to the external DRAM and notify the CPU.

    2)Data read(write)and conversion module:The input feature tile data is read from the buffer,after which the input is transferred to the Winograd domain through the input conversion module via the implementation ofThe filter data is read from the external memory by the filter buffer.Conversion to the Winograd domain is then completed by performingBecause these two data conversion processes only involve changes to symbol bits and shift operations,the calculation module does not need to be too complex.After the data operation is completed,the AT( )A operation is executed to transform the resulting data into a 2 × 2 convolution;after the configurable accumulator operation,the expired data is finally overwritten in the input characteristic graph buffer and written back to the external memory.

    3)Data calculation module:After conversion to the Winograd domain,the data input operation module completes the point product operation,after which the data flows to the accumulation module.In terms of the process of performing the accumulation operation,the information from the controller is received,and GCONV and DSC are completed by separating the channels.In the Sub-Channel operation of DSC,the data from all input channels are converted directly and then saved to the on-chip buffer.For GCONV,moreover the input channels are divided into several groups,after which the data in each group is aggregated and stored in the on-chip buffer following conversion.

    4.2 Structural Design of Computing Module

    The calculation module is used to process the converted data of the input feature map and filter.Fig.9 illustrates the overall architecture design of the calculation module,which can be further divided into two main calculation sub-modules—namely,the Multi_Channel Wise Summation and Multiplication Calculation Array modules—that respectively complete the calculation of the accumulation and dot product by configuring different accumulator functions,and can thereby achieve the calculation of the DSC,GCONV,and Conventional Convolution.

    Figure 9:The architecture of the computing module

    In the multiplication calculation array module,since the accelerator uses FDWA based on F(2×2,3×3),its core calculation is the dot product calculation between two matrices of size 4 × 4.Since the Sparse-Winograd algorithm introduces a lot of sparsity into the point product calculation,the point product calculation of the 4 × 4 matrices can be realized with fewer multipliers.Thus,in this design,3 × 3 multipliers are used to realize the 2D dot product calculation;that is,9 multipliers are designed in the PE.This is mainly because both the input feature block and the convolution kernel block are converted to ReLU and Prune respectively following the Winograd domain conversion.The overall density of the original parameters will decrease by about 40% to 50%.Therefore,it is reasonable to choose 9 multipliers.This meets most of the multiplication requirements in one cycle.Of course,there are certain(albeit very few)cases in which the multiplication operation needs to be greater than 9.When this occurs,two more cycles are required to pause a single PE pipeline.The dot product calculation unit in the entire calculation module is composed of 8 × 8 such PEs;this means that 64 F(2×2,3×3)calculations can be performed simultaneously.In order to efficiently and accurately realize FDWA,after the input data is converted to the Winograd domain,a highly efficient and effective parallel operation module can be realized by inputting the feature map channel selector and the filter data distributor.

    The input data is calculated and output by means of a dot product calculation module to a specifci confgiurable accumulator for processing purposes.Through the setting of different multi-channel accumulator functions,the convolution acceleration calculation of the lightweight CNNs can be realized;this process also ensures the configurability of the accelerator.

    As shown in Eq.(4),the parameters in the conversion matrix based on F(2×2,3×3)only have(0,±1,±1/2);thus,the multiplication and division operations can be replaced by shift,addition and subtraction operations,which reduces the computational complexity and saves on hardware resources.Fig.10a presents the conversion template of the conversion matrix B,while Fig.10b outlines the conversion template of the conversion matrix B;the combination of the two is the conversion module of the input feature block.Similarly,transformation matrices A and G,along with their transposition matrices,can also be used to obtain transformation templates consiing of addition,subtraction and shift.

    Figure 10:Conversion template of the conversion Matrix B

    4.3 Line Buffer Module Design

    Since the on-chip storage space is insufficient to store the entire input feature map and filter,a specific storage structure is required if we are to improve the degree of data reuse and improve the smooth data transfer between storage structures at all levels.Fig.11 illustrates the linear buffer module used in this design.For the input feature map H×W,there is only a low correlation between the data of each channel.After tiling is complete,the horizontal linear buffer contains M × W elements.In the buffer zone,after the input feature map has been divided into tiles according to the block algorithm,the Winograd domain conversion module completes the conversion and inputs the result into the PE to perform the calculation.Moreover,in order to improve the calculation efficiency,the data handling time is hidden in the data calculation.The input feature map buffer further adopts a ‘double buffer’ setting;that is,when the slider slides in m linear buffers,the other m linear buffers are ready for the next calculation.In the convolution kernel buffer,the convolution kernel is independently decomposed according to each channel of the decomposition method,while the Sparse-Winograd algorithm is executed on the subsequent input feature block.

    Figure 11:Storage architecture overview

    5 Experimental Analysis and Results

    5.1 Experimental Analysis

    In this section,an analysis model will be established to help with theoretical analysis of the accelerator performance.Tab.4 presents the meaning of each parameter expression.

    In theory,the total number of multiplication(Countermulti)operations performed by a single-layer convolution in the accelerator design scheme can be calculated by Eq.(14),as follows:

    The data throughput of the entire accelerator in a single cycle is(16×16×4×4)data points;thus,the required clock cycle can be expressed as follows:

    Furthermore,the time required for the whole accelerator to run a single-layer convolution Ttotalcan be expressed as follows:

    Table 4:The meaning of each parameter

    5.2 Experimental Results

    SystemC is a collaborative software/hardware design language,as well as a new system-level modeling language.It contains a series of C++ classes and macros,and further provides an event-driven simulation core enabling the system designer to use C++ morphology in order to simulate parallel processes.In this paper,system C is used to accurately simulate each module and thread to facilitate counting of the number of cycles under a specific load.In order to fully consider the accelerator’s acceleration effect under various convolution conditions,this paper utilizes workloads that are typically representative of both Conventional Convolution and new convolution applications,which are named VGG-16 and MobileNet v1 respectively.The architecture parameters of each network are listed in Tab.5.Each layer of DSC consists of DSC and SC,while the number of output channels is the convolution kernel size of SC,i.e.,1 × 1 × n.

    Table 5:The network architecture parameters of VGG-16(left)and MobileNetV1(right)

    The first layer of the network,as the original input,does not pass through either the ReLU layer or the Prune layer,and its parameter density is 100%.In the simulation process,the accelerator operating frequency is set to 200 MHz,while the data itself is represented by 8-bit fixed points.Fig.12 plots the running time after hardware implementation based on Conventional Convolution(in blue),the Winograd algorithm(in red),the Sparse-Winograd algorithm(in green),and FDWA(purple)under VGG-16.It can be seen from the figure that the dynamic sparseness introduced by the Sparse-Winograd algorithm both reduces the hardware workload and improves the execution speed.The Sparse-Winograd algorithm and FDWA achieve the fastest running times;accordingly,it can be seen that FDWA does not significantly reduce the running speed in exchange for improving the hardware applicability and reducing the hardware consumption.However,it should be noted that the FDWA is slow in the first layer of convolution operations;this is mainly due to the insufficient sparsity of the first layer of convolution parameters,which results in the interruption of pipelines and an increased number of empty and other cycles.

    Figure 12:Running time of VGG-16 under different convolution methods

    Moreover,Fig.13 plots the running time after hardware implementation based on deep separable convolution(in blue),the Winograd acceleration algorithm(in red),Sparse-Winograd acceleration algorithm(in green),and FDWA(in purple)under the MobileNet V1 load.The figure shows us that the Sparse-Winograd acceleration algorithm and FDWA can still obtain a very good acceleration effect.However,as MobileNet V1 subsequently performs DSC,although the input image size decreases due to the large number of channels,the number of convolutional layer operations also increases due to the increase in accelerator control time and storage time,resulting in an overall increase in total operation time.

    Figure 13:Running time of MobileNet V1 under different convolution methods

    From the comparative analysis of Figs.12 and 13,it can be seen that MobileNet V1 requires a shorter running time than the traditional VGG-16,which demonstrates the advantages of the new convolution method,as well as showing that the lightweight neural network is more suitable for embedded or mobile devices.It should however be noted that the actual running time of both is longer than the theoretical derivation time.This is mainly due to the fact that it is not possible to hide the time spent on system initialization,pipeline suspension,and data handling.However,for both Mobilenet V1 or VGG-16,it can be seen that the FDWA has an acceleration effect of 3x–4.15x,and further achieves an improvement of between 1.4x–1.8x compared with the traditional Winograd algorithm.Moreover,an accelerator that utilizes the FDWA has unparalleled advantages in terms of hardware flexibility:it is not only compatible with different convolution methods,but also capable to being adapted to different types of convolution sizes.

    Tab.6 presents a comparison between the design scheme of this paper and several mainstream accelerator design schemes,with a particular focus on the difference in hardware resource consumption and utilization rate.From the table,it can be concluded that the accelerator design scheme proposed in this paper not only resolves the issue of the accelerator being insufficiently applicable,but also exhibits a great advantage in terms of its utilization of hardware resources;this is mainly due to the dynamic sparsity brought about by the Sparse-Winograd algorithm and the skip 0 module in the calculation module,which reduce the amount of multiplication required,save on hardware consumption and improve the hardware(particularly multiplier)utilization.

    Table 6:Hardware utilization of different design schemes

    6 Conclusion

    Based on the Sparse-Winograd algorithm,a highly parallel design of a reconfigurable CNN accelerator based on Sparse-Winograd F(2×2,3×3)is realized in this paper.Our proposed model not only improves the applicability of the algorithm,but also avoids the complex pre-operation generally associated with Winograd series algorithms and reduces the diffciulty of hardware implementation.After experimental verification,it can be concluded that the Sparse-Winograd algorithm decomposition method not only improves the operational efficiency of the algorithm in terms of hardware,but also realizes an acceleration of 3x–4.15x for VGG-16 and MobileNet V1,which is 1.4x–1.8x better than the Winograd algorithm.

    Funding Statement:This paper and research results were supported by the Hunan Provincial Science and Technology Plan Project.The specific grant number is 2018XK2102.Yunping Zhao,Jianzhuang Lu and Xiaowen Chen all received this grant.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    伊人久久国产一区二区| 波多野结衣一区麻豆| 久久久精品区二区三区| 亚洲欧洲日产国产| 国产欧美日韩综合在线一区二区| 性少妇av在线| 人成视频在线观看免费观看| 宅男免费午夜| 久久久久精品人妻al黑| 日本欧美视频一区| 亚洲成av片中文字幕在线观看| 在线观看人妻少妇| 三上悠亚av全集在线观看| 精品亚洲成国产av| 午夜免费观看性视频| 亚洲第一av免费看| 亚洲成人国产一区在线观看 | 久久狼人影院| 亚洲精品成人av观看孕妇| 久久久久久免费高清国产稀缺| 亚洲欧美精品综合一区二区三区| 亚洲国产最新在线播放| 一二三四在线观看免费中文在| 欧美 亚洲 国产 日韩一| 国产精品一二三区在线看| 蜜桃国产av成人99| 十八禁高潮呻吟视频| 国产爽快片一区二区三区| 一边摸一边做爽爽视频免费| 欧美 日韩 精品 国产| 亚洲av成人不卡在线观看播放网 | 国产日韩一区二区三区精品不卡| 黄色怎么调成土黄色| 中文字幕高清在线视频| 丰满少妇做爰视频| 97人妻天天添夜夜摸| 麻豆乱淫一区二区| 亚洲av成人不卡在线观看播放网 | 人妻人人澡人人爽人人| av国产久精品久网站免费入址| 亚洲四区av| 纯流量卡能插随身wifi吗| 男人添女人高潮全过程视频| 国产精品一区二区在线观看99| a级毛片在线看网站| 欧美日韩视频精品一区| 国产97色在线日韩免费| 成人影院久久| 国产亚洲最大av| 久久久久久久精品精品| 韩国高清视频一区二区三区| 多毛熟女@视频| 少妇人妻 视频| 少妇人妻久久综合中文| 日韩大片免费观看网站| 黑人猛操日本美女一级片| 午夜福利视频精品| 一二三四在线观看免费中文在| 国产成人欧美| videos熟女内射| 欧美 日韩 精品 国产| 亚洲伊人色综图| 日韩欧美一区视频在线观看| 蜜桃在线观看..| 国产有黄有色有爽视频| 只有这里有精品99| 日韩中文字幕视频在线看片| 亚洲国产最新在线播放| 老熟女久久久| 五月开心婷婷网| 成人黄色视频免费在线看| 亚洲国产精品一区三区| 97精品久久久久久久久久精品| 国产精品香港三级国产av潘金莲 | 视频在线观看一区二区三区| 日本vs欧美在线观看视频| 操美女的视频在线观看| 欧美日韩视频高清一区二区三区二| 亚洲成人手机| 一边摸一边抽搐一进一出视频| 波多野结衣av一区二区av| 成人亚洲欧美一区二区av| bbb黄色大片| 观看美女的网站| 日日爽夜夜爽网站| 久久久精品区二区三区| 国产伦理片在线播放av一区| 视频在线观看一区二区三区| www.熟女人妻精品国产| 黄色视频在线播放观看不卡| 日韩 欧美 亚洲 中文字幕| 肉色欧美久久久久久久蜜桃| 亚洲免费av在线视频| 丰满少妇做爰视频| videos熟女内射| 国产有黄有色有爽视频| bbb黄色大片| 国产亚洲欧美精品永久| 亚洲婷婷狠狠爱综合网| 老司机靠b影院| 91精品三级在线观看| 日本av免费视频播放| 天美传媒精品一区二区| 日韩中文字幕欧美一区二区 | 免费黄网站久久成人精品| 精品国产一区二区三区四区第35| 精品人妻熟女毛片av久久网站| 国产精品av久久久久免费| av在线观看视频网站免费| 亚洲成国产人片在线观看| 国产免费一区二区三区四区乱码| 老司机靠b影院| 国产免费又黄又爽又色| 黄色一级大片看看| 黄片无遮挡物在线观看| 最新在线观看一区二区三区 | 日本av免费视频播放| 91老司机精品| 999精品在线视频| 精品人妻熟女毛片av久久网站| 一区在线观看完整版| 久久久久久久精品精品| 久久久久国产精品人妻一区二区| 涩涩av久久男人的天堂| 尾随美女入室| 亚洲成色77777| 日韩大码丰满熟妇| 日本一区二区免费在线视频| 搡老乐熟女国产| 水蜜桃什么品种好| 精品一品国产午夜福利视频| 精品国产一区二区三区久久久樱花| 婷婷色综合大香蕉| 黄片播放在线免费| 无限看片的www在线观看| 免费黄网站久久成人精品| 人妻 亚洲 视频| 国产男女超爽视频在线观看| 国产黄色免费在线视频| 啦啦啦在线观看免费高清www| 久久99热这里只频精品6学生| 亚洲欧美日韩另类电影网站| 999久久久国产精品视频| 国产精品久久久av美女十八| 国产精品二区激情视频| 男女免费视频国产| 国产精品成人在线| 日韩中文字幕视频在线看片| 亚洲,一卡二卡三卡| 韩国av在线不卡| 久久久精品免费免费高清| 精品国产露脸久久av麻豆| 最黄视频免费看| 国产日韩一区二区三区精品不卡| 亚洲欧美中文字幕日韩二区| 亚洲第一av免费看| 欧美日韩综合久久久久久| 国产一区二区三区av在线| 18禁裸乳无遮挡动漫免费视频| 丁香六月欧美| www日本在线高清视频| 国产亚洲av片在线观看秒播厂| 亚洲欧美成人精品一区二区| 人妻 亚洲 视频| 亚洲av电影在线进入| 丝袜人妻中文字幕| xxx大片免费视频| 91精品国产国语对白视频| 激情视频va一区二区三区| 韩国精品一区二区三区| www.av在线官网国产| 满18在线观看网站| 爱豆传媒免费全集在线观看| 嫩草影院入口| 妹子高潮喷水视频| 午夜免费男女啪啪视频观看| 日韩av在线免费看完整版不卡| 日本vs欧美在线观看视频| 天堂8中文在线网| 国产精品国产三级国产专区5o| 蜜桃在线观看..| 国产福利在线免费观看视频| 少妇被粗大猛烈的视频| 久久久国产一区二区| 国产在视频线精品| 亚洲国产精品国产精品| 捣出白浆h1v1| 丝袜人妻中文字幕| 国产成人啪精品午夜网站| a 毛片基地| 一本一本久久a久久精品综合妖精| 91国产中文字幕| www.熟女人妻精品国产| 最近的中文字幕免费完整| 欧美黄色片欧美黄色片| av网站免费在线观看视频| 99热网站在线观看| 少妇猛男粗大的猛烈进出视频| 老熟女久久久| 久久女婷五月综合色啪小说| 亚洲美女黄色视频免费看| e午夜精品久久久久久久| 亚洲色图综合在线观看| 熟妇人妻不卡中文字幕| 亚洲久久久国产精品| 亚洲av国产av综合av卡| 国产精品三级大全| 五月天丁香电影| 午夜影院在线不卡| 亚洲一码二码三码区别大吗| 一级片'在线观看视频| 日韩欧美精品免费久久| 老汉色∧v一级毛片| 国产野战对白在线观看| 色网站视频免费| 国产成人91sexporn| 91精品伊人久久大香线蕉| 国产精品国产三级国产专区5o| 国产日韩欧美亚洲二区| 看免费成人av毛片| 超色免费av| 日本猛色少妇xxxxx猛交久久| 秋霞在线观看毛片| 成人免费观看视频高清| 香蕉国产在线看| 欧美日韩视频高清一区二区三区二| 老司机在亚洲福利影院| 一区二区日韩欧美中文字幕| 亚洲精品日韩在线中文字幕| 一区二区三区四区激情视频| av在线app专区| 国产男女超爽视频在线观看| 精品久久久久久电影网| 亚洲国产最新在线播放| 亚洲第一av免费看| 18在线观看网站| 一级片'在线观看视频| 国产片特级美女逼逼视频| 欧美97在线视频| 天天操日日干夜夜撸| 国产精品三级大全| 黄网站色视频无遮挡免费观看| 欧美黑人欧美精品刺激| 一区在线观看完整版| 日日爽夜夜爽网站| 亚洲伊人色综图| 狠狠婷婷综合久久久久久88av| 777米奇影视久久| 日韩 亚洲 欧美在线| 亚洲成人国产一区在线观看 | 一级片免费观看大全| 中文精品一卡2卡3卡4更新| 中文字幕精品免费在线观看视频| 精品亚洲成国产av| 成人亚洲欧美一区二区av| 最新在线观看一区二区三区 | 亚洲av福利一区| 国产老妇伦熟女老妇高清| 日韩一区二区视频免费看| 超色免费av| 国精品久久久久久国模美| 国产乱人偷精品视频| 国产免费现黄频在线看| 99精品久久久久人妻精品| 亚洲综合精品二区| 亚洲人成网站在线观看播放| 国产精品无大码| 久久久久久久精品精品| 久久精品国产a三级三级三级| 亚洲成av片中文字幕在线观看| 亚洲熟女精品中文字幕| 精品少妇久久久久久888优播| 一级黄片播放器| 亚洲国产看品久久| 人成视频在线观看免费观看| 日韩成人av中文字幕在线观看| 中文字幕制服av| 大香蕉久久网| 国产 一区精品| 亚洲,一卡二卡三卡| 国产欧美日韩一区二区三区在线| 国产精品久久久人人做人人爽| 国产精品 国内视频| 日韩成人av中文字幕在线观看| 自线自在国产av| 90打野战视频偷拍视频| 日韩制服骚丝袜av| 国产精品久久久久久精品电影小说| 午夜免费观看性视频| 久久精品国产亚洲av涩爱| 这个男人来自地球电影免费观看 | 欧美在线黄色| 制服丝袜香蕉在线| 免费av中文字幕在线| 亚洲成人一二三区av| 国产欧美亚洲国产| 午夜老司机福利片| 无限看片的www在线观看| 亚洲,欧美,日韩| 中文字幕色久视频| 一边摸一边抽搐一进一出视频| 熟女av电影| 日韩不卡一区二区三区视频在线| 无限看片的www在线观看| 成年动漫av网址| 欧美精品一区二区免费开放| 啦啦啦在线免费观看视频4| 精品亚洲乱码少妇综合久久| 天天躁狠狠躁夜夜躁狠狠躁| 精品国产露脸久久av麻豆| 亚洲av成人精品一二三区| 国产日韩一区二区三区精品不卡| 777久久人妻少妇嫩草av网站| av视频免费观看在线观看| 91aial.com中文字幕在线观看| 少妇猛男粗大的猛烈进出视频| 久久人人97超碰香蕉20202| 国产精品av久久久久免费| xxx大片免费视频| 欧美 亚洲 国产 日韩一| 久久久精品免费免费高清| 免费高清在线观看视频在线观看| 久久 成人 亚洲| 国产淫语在线视频| 色精品久久人妻99蜜桃| 日本av手机在线免费观看| 国产爽快片一区二区三区| 亚洲欧美精品自产自拍| xxxhd国产人妻xxx| 亚洲国产看品久久| 操出白浆在线播放| 久久久久精品性色| av有码第一页| 多毛熟女@视频| 亚洲av在线观看美女高潮| 午夜日本视频在线| 丰满少妇做爰视频| 一本一本久久a久久精品综合妖精| 亚洲精品乱久久久久久| 在线观看www视频免费| 中文字幕另类日韩欧美亚洲嫩草| 黄片播放在线免费| 国产免费又黄又爽又色| 最黄视频免费看| 亚洲欧美一区二区三区国产| 亚洲av电影在线进入| 中国国产av一级| 少妇精品久久久久久久| 国产精品一国产av| 色婷婷久久久亚洲欧美| 黄色一级大片看看| 久久久精品94久久精品| av线在线观看网站| 久久精品亚洲av国产电影网| 中文字幕精品免费在线观看视频| 亚洲在久久综合| 国产精品免费大片| 亚洲伊人色综图| 久久性视频一级片| 色婷婷久久久亚洲欧美| 美女高潮到喷水免费观看| 免费av中文字幕在线| 精品人妻一区二区三区麻豆| 欧美精品一区二区免费开放| 一边摸一边抽搐一进一出视频| 久久久久久久久久久免费av| 制服人妻中文乱码| 波多野结衣av一区二区av| 国产熟女午夜一区二区三区| 欧美日韩av久久| 成人亚洲欧美一区二区av| 精品一区二区三区av网在线观看 | 18禁国产床啪视频网站| av又黄又爽大尺度在线免费看| 飞空精品影院首页| 日韩人妻精品一区2区三区| 午夜福利视频精品| 1024香蕉在线观看| 色婷婷av一区二区三区视频| 亚洲欧美一区二区三区国产| 久久国产精品大桥未久av| 国产亚洲最大av| 国产亚洲精品第一综合不卡| 国产黄频视频在线观看| 嫩草影院入口| 亚洲人成网站在线观看播放| av片东京热男人的天堂| 一二三四在线观看免费中文在| 老司机影院毛片| 两个人看的免费小视频| 国产成人系列免费观看| 69精品国产乱码久久久| 一二三四在线观看免费中文在| 久久人妻熟女aⅴ| 国精品久久久久久国模美| 亚洲成人免费av在线播放| 男女边摸边吃奶| 丝袜美足系列| 欧美激情 高清一区二区三区| 99久久精品国产亚洲精品| 国产日韩欧美视频二区| 香蕉丝袜av| 亚洲 欧美一区二区三区| 久久精品aⅴ一区二区三区四区| 亚洲国产av新网站| 亚洲成人一二三区av| av网站免费在线观看视频| 9191精品国产免费久久| 国产精品二区激情视频| 欧美另类一区| 国产日韩欧美在线精品| 久久这里只有精品19| 国产无遮挡羞羞视频在线观看| 永久免费av网站大全| 9191精品国产免费久久| 日本黄色日本黄色录像| 国产片内射在线| 国产日韩欧美在线精品| 亚洲国产av新网站| 黄色 视频免费看| 日韩大码丰满熟妇| 亚洲欧美中文字幕日韩二区| 免费观看人在逋| 色婷婷久久久亚洲欧美| 亚洲精品久久久久久婷婷小说| 日韩中文字幕欧美一区二区 | 久久99热这里只频精品6学生| 国产精品.久久久| 91精品伊人久久大香线蕉| 国产精品二区激情视频| 久久久国产一区二区| 黑人巨大精品欧美一区二区蜜桃| av在线观看视频网站免费| 又黄又粗又硬又大视频| 黄色一级大片看看| 久久人人爽av亚洲精品天堂| 考比视频在线观看| 国产成人啪精品午夜网站| 天天躁狠狠躁夜夜躁狠狠躁| 99香蕉大伊视频| 建设人人有责人人尽责人人享有的| 香蕉丝袜av| 少妇猛男粗大的猛烈进出视频| 18禁国产床啪视频网站| 欧美日韩亚洲国产一区二区在线观看 | 国产人伦9x9x在线观看| 美女午夜性视频免费| 少妇人妻久久综合中文| 欧美成人精品欧美一级黄| 免费高清在线观看视频在线观看| 成年人午夜在线观看视频| 99国产综合亚洲精品| 卡戴珊不雅视频在线播放| 精品少妇黑人巨大在线播放| 欧美国产精品一级二级三级| 国产xxxxx性猛交| 十分钟在线观看高清视频www| 中文字幕制服av| 久久性视频一级片| 亚洲五月色婷婷综合| 日韩av不卡免费在线播放| 日韩,欧美,国产一区二区三区| 91老司机精品| 亚洲成人av在线免费| 男人爽女人下面视频在线观看| 这个男人来自地球电影免费观看 | 青春草亚洲视频在线观看| 成人黄色视频免费在线看| 亚洲av男天堂| 一二三四在线观看免费中文在| 国产精品一区二区精品视频观看| 亚洲成av片中文字幕在线观看| 国产成人91sexporn| 亚洲精品中文字幕在线视频| 免费观看av网站的网址| 制服人妻中文乱码| 精品亚洲成a人片在线观看| 大香蕉久久成人网| 2021少妇久久久久久久久久久| 黄色视频不卡| 国产精品一区二区在线不卡| 美女中出高潮动态图| 国产激情久久老熟女| kizo精华| 中文字幕高清在线视频| 免费观看人在逋| 亚洲欧美一区二区三区国产| 国产熟女午夜一区二区三区| 黑人巨大精品欧美一区二区蜜桃| 久久久国产一区二区| 日韩欧美一区视频在线观看| 校园人妻丝袜中文字幕| 精品国产乱码久久久久久男人| 日韩 亚洲 欧美在线| 一区在线观看完整版| 国产野战对白在线观看| 国产片内射在线| 欧美激情极品国产一区二区三区| 老司机深夜福利视频在线观看 | 久久久国产精品麻豆| 一级毛片我不卡| 欧美精品人与动牲交sv欧美| 美女视频免费永久观看网站| 99久久精品国产亚洲精品| 日韩精品免费视频一区二区三区| 天美传媒精品一区二区| 中文字幕另类日韩欧美亚洲嫩草| 99热全是精品| 极品人妻少妇av视频| 成年动漫av网址| 黄色怎么调成土黄色| 成人亚洲欧美一区二区av| 亚洲欧美中文字幕日韩二区| 69精品国产乱码久久久| 欧美精品高潮呻吟av久久| 国产精品一区二区在线不卡| 国产av精品麻豆| 一级a爱视频在线免费观看| 看非洲黑人一级黄片| 亚洲欧美色中文字幕在线| 亚洲情色 制服丝袜| 天天添夜夜摸| 久久久久国产精品人妻一区二区| bbb黄色大片| 国产淫语在线视频| 啦啦啦中文免费视频观看日本| 制服丝袜香蕉在线| 久久久久久免费高清国产稀缺| e午夜精品久久久久久久| 国产午夜精品一二区理论片| 少妇猛男粗大的猛烈进出视频| 国产成人精品久久二区二区91 | 天天操日日干夜夜撸| 国产精品久久久久久精品古装| 最近最新中文字幕免费大全7| 亚洲视频免费观看视频| 精品卡一卡二卡四卡免费| 人成视频在线观看免费观看| 悠悠久久av| 777米奇影视久久| 99re6热这里在线精品视频| 又黄又粗又硬又大视频| 街头女战士在线观看网站| 男女午夜视频在线观看| 国产日韩一区二区三区精品不卡| 精品国产一区二区三区久久久樱花| 国产精品一区二区精品视频观看| 一区二区av电影网| 欧美黑人精品巨大| 国精品久久久久久国模美| 男女下面插进去视频免费观看| 在线观看www视频免费| 中国三级夫妇交换| 一个人免费看片子| 亚洲免费av在线视频| tube8黄色片| 伦理电影免费视频| 午夜福利免费观看在线| 欧美日韩av久久| 一级片免费观看大全| 一二三四在线观看免费中文在| 国产极品粉嫩免费观看在线| 一区二区日韩欧美中文字幕| 国产成人av激情在线播放| 久久精品国产亚洲av涩爱| 亚洲四区av| 黄片无遮挡物在线观看| 欧美精品高潮呻吟av久久| 久久久久久久久免费视频了| 51午夜福利影视在线观看| 老司机靠b影院| 高清av免费在线| 精品人妻熟女毛片av久久网站| 国产精品熟女久久久久浪| 欧美人与性动交α欧美精品济南到| 精品国产一区二区三区四区第35| 日日爽夜夜爽网站| 日韩人妻精品一区2区三区| av线在线观看网站| 亚洲av电影在线观看一区二区三区| 最近中文字幕2019免费版| 久久久精品区二区三区| 精品久久蜜臀av无| 80岁老熟妇乱子伦牲交| 亚洲成人手机| 日韩大码丰满熟妇| 亚洲成人免费av在线播放| 9色porny在线观看| 亚洲第一区二区三区不卡| 如何舔出高潮| 夫妻午夜视频| 人体艺术视频欧美日本| 天堂俺去俺来也www色官网| 国产精品人妻久久久影院| 啦啦啦 在线观看视频| 亚洲欧美精品综合一区二区三区| 日本vs欧美在线观看视频| 免费在线观看完整版高清| 国产精品久久久久成人av| 久久久亚洲精品成人影院| 韩国高清视频一区二区三区| 日本色播在线视频| 多毛熟女@视频| 熟妇人妻不卡中文字幕| 免费不卡黄色视频| 精品人妻一区二区三区麻豆| 最近最新中文字幕免费大全7| 国产精品久久久av美女十八| 亚洲精品国产色婷婷电影| 五月开心婷婷网| 2021少妇久久久久久久久久久| 久久人妻熟女aⅴ| 老司机亚洲免费影院| 国产福利在线免费观看视频| 免费日韩欧美在线观看|