• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    FPGA Optimized Accelerator of DCNN with Fast Data Readout and Multiplier Sharing Strategy

    2024-01-12 03:46:28TuoMaZhiweiLiQingjiangLiHaijunLiuZhongjinZhaoandYinanWang
    Computers Materials&Continua 2023年12期

    Tuo Ma,Zhiwei Li,Qingjiang Li,Haijun Liu,Zhongjin Zhao and Yinan Wang

    College of Electronic Science and Technology,National University of Defense Technology,Changsha,410073,China

    ABSTRACT With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hardware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×-3.14×energy efficiency compared with others.

    KEYWORDS FPGA;accelerator;DCNN;fast data readout strategy;multiplier sharing strategy;network quantization;energy efficient

    1 Introduction

    In recent years,with the rapid development of artificial intelligence,various applications based on DCNN have attracted extensive attention in the industry.DCNN has not only made meaningful progress in the field of computer vision,such as image recognition and detection[1],target tracking[2],and video surveillance[3],but also made breakthroughs in other areas,such as natural language processing [4] network security [5],big data classification [6],speech recognition [7],and intelligent control[8].

    Currently,the mainstream hardware deployment platforms of DCNN include CPU,ASIC,FPGA,etc.The advantages of CPU are mature micro-architecture and advanced semiconductor device technology,but its serial computing mechanism limits the operational speed of DCNN.ASIC is integrated circuits designed and manufactured according to user demand.It has the advantages of high computing speed and low energy consumption,but its disadvantages are high design cost and time cost.FPGA is a field programmable gate array,allowing repeated programming,and has the advantages of low energy consumption,parallelism,and low cost,so FPGA can be used in parallel computing,cloud computing[9],neural network inference[10,11],and other aspects.

    However,the enormous network parameters and computing scale of DCNN,as well as the limited logic capacity of FPGA,make hardware deployment on FPGAs inefficient.In order to improve the operation speed,various methods including data multiplexing [12],network quantization [13–15],and network pruning [16,17] are proposed.The traditional sequential sliding window (SSW)method[18,19]has two drawbacks.One is that the multiplexing rate between rows of sliding window data is low,and the other one is that the storage order of the operation results is not convenient for the operation of the pooling layer.Besides,much previous research into reducing the energy consumption of FPGA mainly focused on the improvement of the utilization of FPGA resources.DCNN has numerous multiplications,which will occupy many DSP blocks on the FPGA.DSP48E2 on Xilinx FPGA supports 27 bit×18 bit multiplication.After network quantization,the data width of multiplication is often 8 bits or less,so multiple multiplications can be completed on a single DSP[20]The SIMD architecture proposed in[21]supports the multiplication of two groups of 8-bit data by a single DSP.The disadvantage is that the input data of two groups of multiplication are the same,and the input data are unsigned numbers.The SDMM architecture proposed in[22]supports six groups of 4-bit data multiplication,four groups of 6-bit data multiplication,or three groups of 8-bit data multiplication on a single DSP.However,its disadvantage is that the quantified weight needs to be further processed,which will make the weight deviate from the actual value,for example,5-bit weight data is 10011,but the actual weight data used for calculation is 10010,thus affecting the accuracy of the quantification network.In addition,the operation needs to be supplemented by shift registers and adders for calculation,which will increase the overhead of other logical resources.In recent years,Xilinx’s official white paper has proposed a structure[23]to achieve a single DSP(DSP48E2)to achieve four groups of 4-bit data multiplication.But the structure requires that the input data is unsigned data,the input data of each two groups of multiplication is the same,and the weight data of each two groups of multiplication is the same.

    In order to improve the energy efficiency of DCNN deployed on FPGA,this paper proposed an FPGA optimized accelerator,which focuses on improving the operation speed and reducing energy consumption.1)A circular sliding window(CSW)data reading method is proposed,compared with other methods,the ‘down-right-up-right’circular sliding window mechanism is adopted to read the input data.Some input data between rows are multiplexed via this method,so the CSW data reading method can take less time to finish the data reading of a single channel.What’s more,the CSW data reading method lets output data in adjacent locations,which can save the data rearrangement time consumption before entering the pooling unit.2)Two types of customized multipliers are introduced to implement multi-groups of signed data multiplication in a single DSP,compared with other multipliers,the customized multipliers in this paper use sign-magnitude representation to encode signed data,so,without bit expansion or result correcting.The customized multipliers in this paper can still support the operation that both weight data and input data are signed data.

    2 Energy Efficiency Improvement Strategies

    This chapter mainly introduces two strategies of the FPGA optimized accelerator to improve energy efficiency.Firstly,the operation speed of FPGA based DCNN accelerator is improved through the fast data readout strategy.Secondly,the multiplier sharing strategy is adopted to reduce the energy consumption of FPGA.

    2.1 Fast Data Readout Strategy

    In order to improve the operation speed of the FPGA-based DCNN accelerator,the research can be carried out from two aspects:reducing the data bandwidth requirement during network operation and reducing the amount of network computation.This paper mainly studies from the perspective of reducing the data bandwidth requirements during network operations.

    There are numerous multiplication and accumulation operations in the operation process of DCNN,but limited by the logical capacity of FPGA,DCNN often only performs single-layer partial multiplication and addition operations,so it is necessary to store the intermediate data generated by the operation.Considering that the memory bandwidth of FPGA is fixed,it is necessary to reduce the data bandwidth requirement of network operation.There are three main methods to reduce the data bandwidth requirement of network operation: data multiplexing,network quantification,and standardized data access patterns.This paper mainly uses data multiplexing and network quantification.

    Data multiplexing achieves the purpose of acceleration by reducing the access frequency to memory,which is generally divided into three types[12]:input multiplexing,output multiplexing,and weight multiplexing.Input multiplexing generally refers to multiplexing the data in the input buffer,thereby reducing the access frequency of the input buffer to the on-chip memory.Weight multiplexing generally refers to reducing the frequency of accesses to off-chip memory.The weights are generally stored in off-chip memory.Output multiplexing generally refers to multiplexing the data in the output register,thereby reducing the memory access frequency of the intermediate data.The FPGA optimized accelerator proposed in this paper uses these three data multiplexing methods.For input multiplexing,a fast data readout strategy is proposed,it can further improve the multiplexing rate of input data via a CSW data reading method.

    As shown in Fig.1,the traditional SSW data reading method uses the ‘a(chǎn)lways right’sequential sliding window mechanism to read the input data.The CSW data reading method proposed in this paper uses the‘down-right-up-right’circular sliding window mechanism to read the input data.Both these two methods can be applied to process multiple channels in parallel.There are two advantages of the proposed CSW data reading method.

    Figure 1:The CSW data reading method and the traditional SSW data reading method

    The first advantage,as shown in Fig.2,is that for each input channel,the time cost of the CSW data reading method is less than that of the SSW data reading method.When kernel size is 3,and stride length is 1,the SSW data reading method spends 18 clock cycles reading the first 4-group input data of each row.Compared with the SSW data reading method,the CSW data reading method only spends 16 clock cycles reading the first 4-group input data of each row.Because in the‘r4’state,the SSW data reading method needs to read 3 new input data,but the CSW data reading method only 1 new input data.In addition,the SSW data reading method spends 12 clock cycles reading the non-first 4-group input data of each row.Compared with the traditional SSW data reading method,the CSW data reading method only spends 8 clock cycles reading the non-first 4-group input data of each row.Because in the‘r2’and‘r4’states,the SSW data reading method needs to read 3 new input data,but the CSW data reading method only reads 1 new input data.

    Figure 2:Time cost of the CSW and SSW data reading method

    The proposed method could be extended to neural networks with different sizes of kernel,padding,and stride.If the size of the input channel is N,the number of padding is P,the stride is S,kernel size is K,to read each channel of one convolutional layer in parallel,the time cost of SSW and CSW data reading method is shown in Eqs.(1)and(2),respectively.SSW data reading method utilizes K rows of input data to obtain 1 row of output data.Compared with the traditional SSW data reading method,the CSW data reading method fully utilizes S+K rows of input data to efficiently obtain 2 rows of output data.

    The saving ratio of time cost defined by(T1-T2)/T1 is shown in Eq.(4),and Fig.3 illustrates the saving ratio of time cost of frequently-used strides(1,2)and kernel size(3×3,5×5,7×7).Commonly,kernel size is larger than stride length in neural networks,such as Resnet50,MobileNetV3,YoloV3,etc.When the stride is the same,the saving ratio has a positive correlation with kernel size.When the kernel size is the same,the saving ratio has a negative correlation with stride.It can be also found that the saving ratio is irrelevant to padding size P and input channel size S.

    Figure 3:Saving ratio of clock cycles of various strides and kernel size

    The second advantage,as shown in Fig.4 is that for the pooling layer,the cache space of the CSW data reading method is less than that of the SSW data reading method.When both kernel size and stride in the pooling layer are 2,the SSW data reading data method cannot directly produce 4 output data in adjacent locations,the pooling operation only starts until obtaining one row of output data,which needs additional cache space to save other output data.Compared with the traditional SSW data reading method,the CSW data reading method can produce every 4 output data in adjacent locations,then the pooling operation can be completed directly,which doesn’t need cache space to save other output data.

    Figure 4:The output data position of the CSW method(left)and the SSW method(right)

    The stride length and kernel size of the pooling layer could affect the size of the cache space.Table 1 summarizes the number of output data rows needed to be cached.When both kernel size and stride length are even or odd,such as MobileNet,and VGG,which have the same size of stride and kernel size in the pooling layer,the CSW data reading method requires less cache space to save other data than SSW data reading method.

    Table 1:Cached rows of various stride and kernel size

    The fast data readout strategy mainly improves the speed of input data reading,so it is applied to the INPUT CU unit of the FPGA optimized accelerator.

    2.2 Multiplier Sharing Strategy

    In order to reduce the energy consumption of FPGA,this paper mainly studies from the perspective of reducing the resource consumption of FPGA.Because the core operations of the convolutional layer and the fully connected layer are both multiplication and addition operations,this paper uses the method of multiplexing the same operation array with the full connection layer and the convolution layer to improve the resource utilization of FPGA.In addition,since DCNN is quantized,the data width of multiplication is 8 bits or less.One DSP on FPGA supports a group of 27 bit×18 bit multiplications.Therefore,a multiplier sharing strategy could be applied.The multiplier in the operation array is customized,where each DSP can simultaneously complete two or three groups of multiplication to reduce the resource consumption of FPGA.

    In this paper,sign-magnitude representation is used to encode signed data.For example,-4 is expressed as 1_00100,and 4 is expressed as 0_00100.The ordinary DSP block of an off-the-shelf FPGA device is modified to a Signed-Signed Three Multiplication(SSTM)unit or a Signed-Signed Double Multiplication(SSDM)unit.The SSTM architecture supports one DSP to complete the multiplication of three groups of 6-bit signed data at the same time,but the three groups of multiplication need to maintain the same input data or weight data.The SSDM architecture supports one DSP to complete two groups of 6-bit signed data multiplication at the same time.The input data and weight data of the two groups of multiplication can be different.The sign bit and magnitude bits in the proposed two structures are operated separately.The sign bit is operated by XOR gates,and the magnitude bit is operated by ordinary DSP.This paper mainly introduces the structure of SSTM and SSDM when the data width is 6,but SSTM and SSDM can also apply to other data widths.

    2.2.1 SSTM

    As shown in Fig.5,the structure of the PE3 unit,an operation unit of the operation array,consists of four SSTM units.A single SSTM unit adopts the structure of a multiplier with three groups of multiplications completed by one DSP at the same time.

    The sign bits of the input data (in [5]) and the sign bits of the three groups of weight data(w1[5],w2[5],w3[5]) are extracted for XOR operation,respectively.The sign bits of three groups of multiplication output data can be obtained.

    Fig.6 shows the magnitude bits operation mode of SSTM.Firstly,the magnitude bits of the weight data of three groups of multiplication are extracted.Then,a new 25-bit weight data is formed by inserting 5-bit 0 data between the magnitude bits of the three groups of weight data.Finally,the 25-bit weight data and the magnitude bits of the input data are multiplied to obtain 30-bit output data.The high 10 bits,middle 10 bits,and low 10 bits of the extracted 30-bit output data are the magnitude bits of the output data of the first group,the second group,and the third group of multiplication.

    Figure 5:PE3 unit architecture

    Figure 6:Magnitude bits operation mode of SSTM

    2.2.2 SSDM

    Considering the number of convolutional kernels in the networks is always an integer multiple of 4.If the accelerator only uses the SSTM structure,it will result in some DSPs not working during the calculation of each convolutional layer not work,which will reduce DSP efficiency,so the accelerator also adopts some SSDM structures that can support input data be different.

    As shown in Fig.7,the structure of the PE2 unit,an operation unit of the operation array,consists of two SSDM units.A single SSDM unit adopts the structure of a multiplier with two groups of multiplications completed by one DSP at the same time.

    Figure 7:PE2 unit architecture

    The two groups’sign bits of the input data(in1[5],in2[5])and the sign bits of the two groups of weight data(w1[5],w2[5],w3[5])are extracted for XOR operation,respectively.The sign bits of two groups of multiplication output data can be obtained.

    Fig.8 shows the magnitude bits operation mode of SSDM.Firstly,the magnitude bits of the weight data of two groups of multiplication operations are extracted,and the magnitude bits of the input data of two groups of multiplication are extracted.Then,a new 25-bit weight data is formed by inserting 15-bit 0 data between the magnitude bits of the three groups of weight data,and a new 15-bit input data is formed by inserting 5-bit 0 data between the magnitude bits of the three groups of weight data.Finally,the 25-bit weight data and the 15-bit input data are multiplied to obtain 40-bit output data.The high 10 bits and low 10 bits of the extracted output data are the magnitude bits of the output data of the first group and the second group of multiplication.

    Figure 8:Magnitude bits operation mode of SSDM

    Besides,SSTM and SSDM structures can also apply to data whose data width is not 6.As shown in Table 2,it gives how many multiplications can be implemented by SSTM structure when data width is 4,6,8,the smaller the data width,the more multiplication can be implemented by SSTM structure.As for the SSDM structure,for various data widths,it can only implement 2 groups of multiplication,and the largest data is 6,which is limited by the data width of DSP48E2.Table 3 gives the data width of new component input data and weight data when the data width of input data and weight data is 4,5,6.

    Table 2:SSTM structure for various data width

    Table 3:SSDM structure for various data width

    Overall,SSTM and SSDM structure can be used in other networks that are quantified,SSDM is limited by quantization precision,while with the development of lightweight networks,more complicated networks can be quantified to lower data width,[24] can quantify the data width of Resnet18 to 4-bit.The multiplier sharing strategy mainly reduces the DSP resource waste of the multiplier,so it is applied to the calculation unit of the FPGA optimized accelerator.

    3 Accelerator

    This chapter mainly introduces the hardware circuit structure of the FPGA optimized accelerator.As shown in Fig.9,the accelerator is mainly divided into three parts: a memory cell,a control unit,and a calculation unit.The memory cell mainly completes the storage of data required for operation,intermediate data for operation,and operating results.The control unit mainly controls the data flow direction of the entire architecture and completes simple operations such as addition,quantization,activation,and pooling.The calculation unit mainly completes the multiplication of the whole network.

    The data path of the accelerator is shown in Fig.9,which is mainly divided into four steps.Firstly,the top control unit (TOP CU) provides control signals to decide the data flow direction of the whole accelerator.Secondly,the input control unit(INPUT CU)reads the input data saved in the RAM group through the RAM selection unit(RAM SEL),at the same time,the weight control unit(WEIGHT CU)reads the weight data saved in FIFO,and both input data and weight data are sent to the PE array.Thirdly,the PE array completes the multiplication of input data and weight data,and the results of the multiplication are sent to the output control unit(OUTPUT CU).Finally,OUTPUT CU reads multiplication results and bias data saved in ROM to complete accumulation,activation,quantification,and max pool calculation,FIFO is used for memory access of intermediate data,and the final results are sent to the RAM group through RAM SEL.

    Figure 9:Architecture of the FPGA optimized accelerator

    When the parameters of DCNN are quantized,the bit widths of the input data and weight data of each layer of DCNN are the same.Therefore,the proposed architecture only completes the design of the DCNN one-layer circuit structure and then continuously multiplexes the entire circuit until all layers of DCNN are calculated.

    3.1 Memory Cell

    The memory cell consists of three parts: DDR4,RAM group,and ROM.DDR4 mainly stores weight data.Because the amount of DCNN weight data is large,on-chip storage resources cannot meet its needs.In addition,considering that the DDR4 read and write clock is inconsistent with the accelerator master clock,it is necessary to add a FIFO unit between the DDR4 and the control unit for buffering.The RAM group mainly stores the transformable data of the DCNN operation process.It is used to store the input data and output data of the DCNN single layer.It is divided into two groups,with 64 RAMs in each group.Since the architecture proposed in this paper is in the form of layer-by-layer calculation,when one group of RAM groups is used as input to provide input data for the control unit,the other group of RAM groups is used as output to save the output data.ROM mainly stores the invariant data during DCNN operation,which is used to store the bias value and quantization parameters of DCNN.

    3.2 Control Unit

    The control unit consists of six parts:TOP CU,INPUT CU,RAM SEL,WEIGHT CU,bias data and quantization parameter control unit(B_Q CU)and OUTPUT CU.

    3.2.1 TOP CU

    TOP CU unit mainly controls the data flow direction of the entire accelerator and provides hyperparameters for other control units.Its control logic is shown in Algorithm 1.Since the operation unit can only calculate the multiplication and accumulation of 64 groups of input channels and 4 groups of convolution kernels in parallel,the whole control logic is divided into three cycles.The first loop is expanded to complete the operation of all input channels corresponding to the four groups of convolution kernels in the convolution layer or the operation of 512 input neurons corresponding to 4 output neurons in the fully connected layer.The operation process includes multiplication,quantification,and activation.The second loop is expanded to complete the operation of all convolution kernels in the single-layer convolution layer or the operation of all output neurons in the single-layer fully connected layer.The third loop is expanded to complete the operation of all layers of DCNN,and the corresponding pooling operation is completed according to the number of layers.

    3.2.2 INPUT CU

    INPUT CU unit mainly reads the input data from the memory cell to the calculation unit.As shown in Fig.10,the INPUT CU unit can read 64 channels of data in parallel and transmit 5 data of each channel to the calculation unit in a single clock cycle.INPUT CU reads the input data using the CSW data reading method.

    Figure 10:Input data flow direction

    The CSW data reading method contains four states,each state updates 9 input data stored in registers,and 9 input data are transmitted to the calculation unit in 2 cycles.In the first state,the input data in‘1’,‘4’,‘5’,‘6’and‘0’positions are sent to PE array in the first clock cycle,the input data in‘2’,‘6’,‘9’,‘6’and‘0’positions are sent to PE array in the second clock cycle,and then,entering the second state through‘down’sliding window.In the second state,the input data in‘5’,‘8’,‘9’,‘12’,and‘4’positions are sent to the PE array in the first clock cycle,the input data in‘6’,‘10’,‘13’,‘14’and‘4’positions are sent to PE array in the second clock cycle,and then,entering the third state through‘right’sliding window.In the third state,the input data in‘6’,‘9’,‘10’,‘13’and‘5’positions are sent to PE array in the first clock cycle,the input data in‘7’,‘11’,‘14’,‘15’and‘5’positions are sent to PE array in the second clock cycle,and then,entering the fourth state through‘up’sliding window.In the fourth state,the input data in ‘2’,‘5’,‘6’,‘9’and ‘1’positions are sent to PE array in the first clock cycle,the input data in‘3’,‘7’,‘10’,‘11’and‘1’positions are sent to PE array in the second clock cycle,and then,entering the first state through‘right’sliding window.

    The CSW method will make the address of input data control logic more complicated.However,the structure reads 64 input data in the same position of different channels,so if the address of one input channel control logic is implemented,other channels only use the same address.As a result,it has a slight increase in logic resources because of more complicated control logic and has a small effect on the performance of the entire structure.

    3.2.3 OUTPUT CU

    The OUTPUT CU unit mainly processes the output data of the calculation unit and saves it to the storage unit.As shown in Fig.11,the OUTPUT CU unit is mainly composed of four COV_FC_OUT units and one COV_ONLY_OUT unit.The COV_FC_OUT unit is composed of 256 input addition trees,accumulators,Q/A units,pooling units,and several multi-channel selectors.It mainly completes the summation of the multiplication results of a row in the first four rows of the PE array and the accumulation of the multiplication and addition results with the first row,as well as the quantification,activation,and pooling operations of the results.In addition,if the PE array cannot calculate all channels at a time,the intermediate data can be temporarily stored in the FIFO unit.The COV_ONLY_OUT unit is composed of a 64-input addition tree,accumulator,SOFTMAX unit,and several multi-channel selectors.It mainly completes the summation of the multiplication results of the third row of the PE array,the accumulation of the bias value,and the accumulation of the calculation results of several channels.When the operation is in the full connection layer,the third row of the operation unit of the PE array is not working,so the 64-input addition tree is also not working.The SOFTMAX unit mainly completes the label value acquisition of the final operation result.

    Figure 11:Output data control unit

    For the Q/A unit,the quantization and activation operations of the accumulated results are completed.Considering that the quantization parameters are floating-point,in order to avoid the multiplication of floating-point,in this paper,the quantization and activation function is fused.The fusion principle is: The data width of the input data is x-bit,the data width of the output data is 5-bit,and the value range is[0,31].As shown in Eq.(7),quantization and activation operations can be regarded as a piecewise function.d1-d31 are piecewise parameters;a is input,and b is output.

    As shown in Fig.12a,the actual hardware deployment uses a 32-channel selector to realize the two operation processes of quantization and activation function.After evaluation,the method that uses a 32-channel selector will cost 306 LUTs and 6 FFs,while the method that uses a float multiplication will cost 329 LUTs and 5 FFs directly.The method used in this work consumes less logic resources.In addition,if M is smaller,more LUTs and FFs will be consumed to prove the accuracy of the calculation.

    Figure 12:Q/A unit(a),pooling unit(b)and SOFTMAX unit(c)

    For the pooling unit,the accelerator uses the max pooling operation.Considering that the four groups of data of the pooling operation come in different clock cycles,a comparator is used to complete four comparisons in four clocks.For every four comparisons,the non-input end of the comparator is set to 0.The structure is shown in Fig.12b.

    For the SOFTMAX unit,as shown in Fig.12c,it is composed of four comparators.SOFTMAX is only working in the last layer,which is divided into three clock cycles to complete the maximum value of four output results.The first clock cycle and the second clock cycle obtain the maximum value and label value of the four groups of data.The third clock cycle obtains the maximum value and label of the current four groups of data and the previous four groups of data.Since the output of the last layer is more than four,the maximum value and the label value of the current four groups of data must be saved.

    3.2.4 Other Control Units

    The RAM SEL unit provides input data for the INPUT CU unit to select the corresponding RAM according to the number of layers.The RAM SEL unit selects the corresponding RAM for the OUTPUT CU unit to save the single-layer operation results.WEIGHT CU unit reads the weight data from DDR4 to the FIFO unit and then transmits the weight in the FIFO to the calculation unit.Since the read-and-write clock frequency of DDR4 is faster than the clock frequency of the accelerator,the PING-PONG mechanism is used to read the weight data.B_Q CU unit reads the bias data and quantization parameters required for the operation from the ROM to the OUTPUT CU unit.

    3.3 Calculation Unit

    The calculation unit is the core unit of the accelerator,completing the multiplication of the convolutional layer and the fully connected layer.As shown in Fig.13,the calculation unit is a PE array,which is divided into three rows,including 64 PE3 units in the first row,64 PE2 units in the second row,and 32 PE2 units in the third row.The PE3 unit consists of 4 groups of SSTM,PE2 unit consists of 2 groups of SSDM.

    Figure 13:PE array

    The input data of the same column in the first row and the second row is the same,and the weight data of each operation unit is different.In addition,the PE array supports two working modes,which are used for the operation of the convolution layer and the fully connected layer,respectively.

    In the working mode of the convolution layer:64 PE3 units and 96 PE2 units are all in a working state,the PE array can complete the multiplication of 64 input channels and 4 convolutional kernels in 2 clock cycles.Because each convolutional kernel has 9 groups of weight data,a single clock cycle of the PE array can complete 1152(64×4×9÷2)groups of multiplication in parallel.The data path is shown in Fig.13.In the first clock cycle,the corresponding data for‘1’,‘3’,‘4’,and‘6’positions are processed by 64 PE3 units and 64 PE2 units.In the second clock cycle,the corresponding data for‘2’,‘5’,‘7’,and‘8’positions are processed by 64 PE3 units and 64 PE2 units.64 PE3 units respond to the multiplication of three groups of convolutional kernels,and 64 PE2 units respond to the multiplication of one group of convolutional kernels.The remaining 32 PE2 units respond to the multiplication of four groups of convolutional kernels in the‘0’position.

    In the working mode of the fully connected layer:64 PE3 units in the first row and 64 PE2 units in the second row are in a working state,but 32 PE2 units in the third row are not working.The PE array can complete the multiplication of 512 input neurons and 4 output neurons in 2 clock cycles.A single clock cycle of the PE array can complete 1024(512×4÷2)groups of multiplication in parallel.

    4 Experiments and Results

    This chapter compares and summarizes the performance parameters of the FPGA optimized accelerator architecture with other published work.Firstly,the network structure used to evaluate and how to complete network quantization is introduced.Secondly,the speed improvement brought by the fast data readout strategy is analyzed and summarized.Thirdly,the energy saving of the multiplier sharing strategy is analyzed.Finally,the overall energy efficiency of the accelerator is evaluated.

    4.1 Network Structure and Quantization Result

    4.1.1 Network Structure

    The datasets should be publicly available to ensure the experiments are reproducible,so the IMAGENET dataset and CIRFAR-10 dataset are selected to be identified by the FPGA optimized accelerator.In addition,as can be seen from Eq.(3),the absolute saving inference delay (Tabsolute) by the fast data readout strategy can also be affected by the image size,so the datasets of two different sizes can be further used to verify the relationship between Tabsoluteand image size,the corresponding result will show in Section 4.2.VGG16 [25] is one of the commonly used network structures of DCNN.It covers almost the most basic operations of DCNN,such as multiplication and accumulation operation,batch normalization operation,pooling operation,etc.,which makes the experiment have wider applicability.In addition,many previous FPGA accelerators select VGG16 as a DCNN model for testing,which is convenient for the performance evaluation of the accelerator proposed in this paper.Therefore,VGG16 is selected to be a testing DCNN model of the optimized FPGA accelerator.

    VGG16 is composed of thirteen convolutional layers and three fully connected layers.After each convolutional layer,it will undergo the bn(batch normalization)layer and activation function.The 13 convolutional layers are divided into five groups.After each group of convolutional layers,a pooling operation is performed to reduce the feature image size.For the CIRFAR-10 dataset,the image size is 32×32×3.For the IMAGENET dataset[26],the image size is 224×224×3.The corresponding network structure is shown in Table 4.

    Table 4:Structure of VGG16

    4.1.2 Network Quantization

    According to whether the quantized network needs retraining,the quantization method is divided into PTQ[27,28](Post-Training Quantization)and QAT[29,30](Quantization-Aware Training).This paper mainly uses PTQ to quantify the network.The algorithm can convert the parameters of the network from floating points to integers without retraining and has a low impact on the accuracy.

    As shown in Fig.14,the quantization process of a single convolutional layer is shown,and there are three steps.First,the parameters w (weight) and b (bias) of the convolution layer are extracted and fused with the parameters such as μy(mean) andδy2(variance) of the bn (batch normalization)layer to obtain the corrected parameters w’(corrected weight) and b’(corrected bias).The bn layer has been integrated into the weights,so there is no need to deploy the bn layer on the FPGA.Second,the parameters x(input data),w’,and b’are quantified,respectively.The quantization equations are shown in Eqs.(8)–(10).Third,the quantized input data(qx-Zx),bias data(qb’-Zb’),and weight data(qw’-Zw’)are all integers.The quantized input data,bias data,and weight data are used for convolution operation.The result(P)multiplied by the quantization parameter(M,Mis floating-point data)and rounded is consistent with the value of the output data after quantization.

    Using the PTQ algorithm,the accuracy of VGG16 in recognizing CIRFAR-10 under different data widths is obtained,as shown in Fig.15.It can be seen that the data width is above 5 bits,and the accuracy is reduced by less than 1%.Therefore,this paper selects a network with data width quantized to 5-bit for deployment on FPGA.However,PTQ has the problem of zero offsets.The value range of conventional 5-bit data is[-16,15].But the range of 5-bit data obtained by the PTQ algorithm may be[-17,14],[-13,18],and so on.Therefore,when the network is deployed on FPGA,it uses a 6-bit width to store 5-bit data.

    Figure 14:The operation flow of PTQ

    Figure 15:Accuracy under different data widths

    4.2 Performance Optimized by Fast Data Readout Strategy

    When the accelerator recognizes the CIRFAR-10 dataset,the system clock frequency of the accelerator is 200 MHz,and the latency for inferring one image is 2.02 ms,so the throughput of the accelerator can be calculated,which is 310.62GOPS.When the accelerator recognizes the IMAGENET dataset,the system clock frequency of the accelerator is 150 MHz,and the latency for inferring one image is 98.77 ms,the corresponding throughput is 313.35GOPS.

    Fig.16 illustrates the latency of all convolutional layers when using the CSW data reading method and SSW data reading method during network inference,the inference-saving delay of each layer by the CSW data reading method can be seen.For the CIRFAR-10 dataset,as shown in Fig.16a,the convolutional layer latency of the CSW data reading method is 0.667 × -0.724× that of the SSW data reading method.Similarly,for the IMAGENET dataset,as shown in Fig.16b,the convolutional layer latency of the CSW data reading method is 0.667 × -0.669× that of the SSW data reading method.It can be observed that the relative time savings vary across different layers.This is because the accelerator cannot complete all the multiplication and accumulation operations within a single layer.When switching convolution kernels or input channels,there is a clock switching time of 2–3 cycles.The impact of clock switching is greater for smaller input image sizes.For example,in the CIRFAR-10 dataset,the input image sizes of layers 11–13 are 2×2,resulting in a relative time savings of approximately 0.276.In contrast,all input layers in the IMAGENET dataset have image sizes of 14×14 or larger,leading to a relative time savings of approximately 0.33×for all layers.

    Figure 16:Two methods to infer the delay of each layer of one image:(a)IMAGENET(b)CIRFAR-10

    Table 5 shows the saved inference time in each convolutional layer using the CSW data reading method,the relation between image size and saving delay by the CSW data reading method can be found.The accelerator’s calculation unit utilized 4 convolution kernels and 64 channels for convolutional operations.Hence,based on the number of input channels and convolution kernels in Table 4,we can obtain the calculation rounds represented by the “epoch”column in Table 5 for each convolutional layer.Then,using these calculation rounds,we can determine the delay savings of individual channels represented by the“delay/image”column in Table 5.From the data in the table,it can be observed that the delay savings of individual channels are positively correlated with the image size.Although the time saved for inference in a single channel from layers 11 to 13 of the IMAGNET dataset is higher than the time saved in a single channel from layers 3 to 4 of the CIFAR-10 dataset,the input image size for layers 11 to 13 of the IMAGNET dataset is smaller than the input image size for layers 3 to 4 of the CIFAR-10 dataset.This difference is caused by the fact that the system clock of the accelerator is lower when processing the IMAGNET dataset compared to when processing the CIFAR-10 dataset.Therefore,the time saved for inference in a single channel is inversely correlated with the system clock of the accelerator.

    Table 5:Each convolution layer saves time

    4.3 Performance Optimized by Multiplier Sharing Strategy

    Table 6 shows the resource consumption of the accelerator when recognizing the CIFAR-10 dataset,while Table 7 shows the resource consumption when recognizing the IMAGENET dataset,the difference in various types of hardware resource consumption when the accelerator recognizes the different sizes of the dataset can be concluded,and DSP consumption can be used to evaluate the DSP efficiency of the accelerator.For the recognition tasks of these two datasets,there is no significant difference in logic resource consumption,but the focus is on storage resource consumption.It is worth noting that due to the larger intermediate image size of the IMAGENET dataset,more BRAM resources are required for storing the intermediate image data.This is a unique characteristic of the FPGA optimized accelerator.For deeper convolutional neural networks,there will not be a significant increase in logical resource usage,and the increase in storage resources will only manifest when the intermediate image size grows larger.

    Table 6:Resource utilization(CIRFAR-10)

    Table 7:Resource utilization(IMAGENET)

    The energy consumption of various hardware resources when deploying the accelerator on an FPGA is shown in Fig.17,the energy influence of different hardware resources can be found,and energy consumption can be used to evaluate the energy efficiency of the accelerator.It is worth noting that the IMAGENET dataset has lower energy consumption for DSP and LOGIC resources compared to the CIFAR-10 dataset.However,the logical resource overhead is similar for both datasets.This energy consumption difference is due to the different system clocks used for the two datasets,as using a lower system clock reduces energy consumption.According to the experimental data in Tables 6 and 7,the BRAM resource overhead required for the IMAGENET dataset is 11.36×higher compared to the CIFAR-10 dataset.However,it is worth noting that the IMAGENET dataset only consumes an additional 1.227 W of energy compared to the CIFAR-10 dataset in terms of BRAM power consumption.This indicates that the energy cost overhead of BRAM is relatively low.

    Figure 17:Power consumption of accelerators:(a)IMAGENET(b)CIRFAR-10

    To reduce the energy consumption of logical resources and ultimately decrease the overall energy consumption of the FPGA optimized accelerator,the accelerator primarily improves the utilization of DSPs.For the multiplier,the accelerator has been custom-designed and introduces the SSTM and SSDM architectures,enabling each DSP to perform multiplication in parallel for 2–3 groups of signed data.The calculation unit of the accelerator consists of 256 SSTM and 192 SSDM,capable of parallel calculation for a total of 1152 groups of multiplication.According to the data in Table 8,it can be observed that compared to using a single DSP for multiplication,the accelerator reduces the number of DSPs by 704,thereby reducing the power consumption by 1.03 W.Since the DDR4 MIG IP invocation requires an additional 3 DSPs,the number of DSPs is 3 more than the number of multipliers.

    Table 8:Power consumption of DSP

    The SSTM and SSDM architectures proposed in this article have some unique features compared to other single DSP implementations of multiple multiplication architectures.As for the SSTM structure,as can be seen from Table 9.Compared with [21],the SSTM structure can complete the multiplication of signed input data and signed output data.As for [22],SSTM consumes much less LUT and keeps the quantified data the same as actual calculation data.Moving on to the SSDM structure,the SSDM structure can support 2 different signed input data,in addition,compared with similar work in[23],the SSDM can support the calculation of higher data width of data from Table 10.

    Table 9:Comparison of SSTM structure and other structure

    Table 10:Comparison of SSDM structure and other structure

    4.4 Performance Optimized by Multiplier Sharing Strategy

    The accelerator chosen is VGG16 as the DCNN model for recognizing the CIFAR-10 dataset and the IMAGENET dataset.This paper customizes various parts of the FPGA optimized accelerator using the Verilog language and deploys it on the Xilinx VCU118.The recognition of the CIFAR-10 dataset and the IMAGENET dataset is evaluated using energy efficiency (GOPS/W) and DSP efficiency(GOPS/DSP)as evaluation metrics.

    For CIFAR-10 dataset recognition,as shown in Table 11,when the system clock is set to 100 MHz,the FPGA optimized achieves an energy efficiency of 27.15 GOPS/W,which is 1.17×more than[31].

    Table 11:Comparison with other structures(CIFAR-10)

    The accelerator of[31]uses the pruning algorithm,and its actual calculation quantity is 62%less than that of the FPGA optimized accelerator.However,the FPGA optimized accelerator can achieve 2–3 groups of multiplication by a single DSP,so the accelerator of[31]and the accelerator of this paper have similar throughput.In addition,the accelerator of[31]uses the SSW data reading method to read the input data,but the FPGA optimized accelerator adopts the CSW data reading method,the CSW data reading method has a higher data multiplexing rate,so the FPGA optimized accelerator has a lower bandwidth requirement,which means that it has less BRAM resource consumption.Therefore,the FPGA optimized accelerator has higher energy efficiency than other accelerator.Moreover,the optimized accelerator can support a higher system clock,when the system clock is set to 200 MHz,the FPGA optimized achieves a DSP efficiency of 0.69 GOPS/DSP,which is 1.73×more than[31].Its energy efficiency is as high as 39.98 GOPS/W,which is 1.73×more than[31].

    For IMAGENET dataset recognition,as shown in Table 12,when the system clock is set to 150 MHz,the FPGA optimized accelerator achieves a DSP efficiency of 0.69 GOPS/DSP,which is 1.11×-3.45×more than[32–36].Additionally,the accelerator has a high energy efficiency of 38.06 GOPS/W,which is 1.18×-2.91×more than of[33–37].When the system clock is set to 200 MHz,although the power consumption gets higher,the increase in throughput is more significant.The FPGA optimized accelerator achieves a DSP efficiency of 0.92 GOPS/DSP,which is 1.48×-4.6×more than of [32–36].What’s more,the accelerator has a high energy efficiency of 38.06 GOPS/W,which is 1.28 × -3.14 × more than [33–37].The single DSP of the accelerator in [38] can only complete a single multiplication operation,while the single DSP of the FPGA optimized accelerator can complete an average of 2.25 multiplication operations.Therefore,even if the DSP data volume of the accelerator in[38]is 2.5 times that of the FPGA optimized accelerator,the calculation scale of the two accelerators is similar.In addition,the accelerator in [38] uses sparse coding to make its weight sparsity reach 88.3%,but the FPGA optimized accelerator only uses the quantization algorithm to compress the weight bit width by 68.8%.This means that the accelerator in[38]has lower bandwidth requirements.However,the use of the CSW data reading method reduces the bandwidth requirement of the FPGA optimized accelerator,so the FPGA optimized accelerator has higher throughput.In addition,the FPGA optimized accelerator has less DSP consumption,so it has higher energy efficiency.

    Table 12:Comparison with other structures(IMAGENET)

    With more BRAM resources and DSP resources,other accelerators may achieve higher throughput than the FPGA optimized accelerator.However,it is worth noting that each DSP unit of the FPGA optimized accelerator can perform an average of 2.55 multiplication.Additionally,by adopting the CSW data reading method,the FPGA optimized accelerator can save approximately 0.33×the data reading time.These features make the accelerator excel in terms of energy efficiency,reaching high standards.

    5 Conclusion

    In this paper,the FPGA optimized accelerator architecture is proposed.To achieve the aim of improving the energy efficiency of the accelerator,the fast data readout strategy is used by the accelerator,compared with the SSW data reading method,it can reduce 33% of inference delay,and delay savings of individual channels are positively correlated with the image size.Moreover,the multiplier sharing strategy adopted by the accelerator can save 61% of the DSP resources of the accelerator,which leads to a decline in energy consumption.Finally,the DSP efficiency and energy efficiency of the accelerator are evaluated.When the system clock of the accelerator is set to 200 MHz,for the CIRFAR-10 dataset,the DSP efficiency is 0.69 GOPS/DSP,which provides 1.73 × speedup DSP efficiency over previous FPGA accelerators,and its energy efficiency is 39.98 GOPS/W,which shows 1.73×energy efficiency compared with others.For the IMAGENET dataset,the DSP efficiency is 0.92 GOPS/DSP,which provides 1.48 × -4.6 × speedup DSP efficiency over previous FPGA accelerators,and its energy efficiency is 41.12 GOPS/W,which shows 1.28×-3.14×energy efficiency compared with others.

    VGG16 is selected as the DCNN model for testing in this paper,but the fast data readout strategy and the multiplier sharing strategy proposed in this paper are also suitable for other neural networks,such as Resnet,MobileNet,etc.,which is the focus of future work.

    Acknowledgement:The authors are very grateful to the National University of Defense Technology for providing the experimental platforms.

    Funding Statement:This work was supported in part by the Major Program of the Ministry of Science and Technology of China under Grant 2019YFB2205102,and in part by the National Natural Science Foundation of China under Grant 61974164,62074166,61804181,62004219,62004220,and 62104256.

    Author Contributions:T.Ma and Z.Li.contributed equally to this work,the corresponding author is Q.Li.The authors confirm their contribution to the paper as follows:study conception and design:T.Ma,Z.Li and Q.Li;data collection: T.Ma and Z.Zhao;analysis and interpretation of results:Z.Li,H.Liu and Y.Wang;draft manuscript preparation:T.Ma and Z.Li.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:The data will be made publicly available on GitHub after Tuo Ma.Author completes the degree.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    伦理电影免费视频| 丝瓜视频免费看黄片| 精品一区二区三卡| 亚洲欧美成人精品一区二区| 成人毛片a级毛片在线播放| 日韩伦理黄色片| 国产精品一区二区在线不卡| 亚洲国产精品成人久久小说| 成人美女网站在线观看视频| 插逼视频在线观看| 一区在线观看完整版| 国产亚洲最大av| 亚洲综合精品二区| 亚洲人成网站高清观看| 插逼视频在线观看| 久久久久久久久大av| 亚洲,一卡二卡三卡| 狂野欧美激情性bbbbbb| 99久久精品一区二区三区| 99精国产麻豆久久婷婷| 我的女老师完整版在线观看| 免费观看的影片在线观看| 蜜桃久久精品国产亚洲av| 蜜桃在线观看..| 亚洲av欧美aⅴ国产| 最近最新中文字幕免费大全7| 久久 成人 亚洲| 国产69精品久久久久777片| 久久久精品免费免费高清| 永久网站在线| 午夜激情久久久久久久| 日韩 亚洲 欧美在线| 国产日韩欧美在线精品| 日本-黄色视频高清免费观看| 国产精品秋霞免费鲁丝片| av在线播放精品| 啦啦啦啦在线视频资源| 亚洲欧美成人精品一区二区| 一区在线观看完整版| 精品一区二区免费观看| 亚洲精品国产色婷婷电影| 国内精品宾馆在线| 久久精品熟女亚洲av麻豆精品| 亚洲久久久国产精品| 老司机影院毛片| 国产精品无大码| av免费观看日本| 视频中文字幕在线观看| 国产精品久久久久久精品古装| 18+在线观看网站| 男女啪啪激烈高潮av片| 麻豆国产97在线/欧美| 国内精品宾馆在线| 六月丁香七月| 直男gayav资源| 岛国毛片在线播放| 3wmmmm亚洲av在线观看| 亚洲婷婷狠狠爱综合网| 日本欧美视频一区| 国产乱人视频| 国产综合精华液| 日本黄色片子视频| 黄色配什么色好看| 最黄视频免费看| 嫩草影院新地址| 国产视频内射| 亚洲欧美日韩东京热| 亚洲美女黄色视频免费看| 一级毛片久久久久久久久女| 欧美极品一区二区三区四区| 成人亚洲欧美一区二区av| 搡女人真爽免费视频火全软件| 亚洲,一卡二卡三卡| 午夜精品国产一区二区电影| av福利片在线观看| 亚洲美女视频黄频| 我要看黄色一级片免费的| 国产在线一区二区三区精| av福利片在线观看| 高清av免费在线| 国产老妇伦熟女老妇高清| 国产成人精品一,二区| 免费观看在线日韩| 黑人猛操日本美女一级片| 日韩 亚洲 欧美在线| 国产在线男女| 亚洲天堂av无毛| 女人十人毛片免费观看3o分钟| 极品教师在线视频| 最近中文字幕2019免费版| 日本vs欧美在线观看视频 | 天天躁日日操中文字幕| videos熟女内射| 人体艺术视频欧美日本| 国产亚洲91精品色在线| 久久精品夜色国产| 亚洲精华国产精华液的使用体验| 亚洲成人中文字幕在线播放| 欧美成人一区二区免费高清观看| 欧美一级a爱片免费观看看| 日本黄大片高清| 少妇的逼好多水| a 毛片基地| 日韩欧美一区视频在线观看 | 久久人人爽人人爽人人片va| 亚洲国产精品成人久久小说| 精品少妇黑人巨大在线播放| 精品久久久久久电影网| 国产人妻一区二区三区在| 99久久精品一区二区三区| 午夜日本视频在线| 国精品久久久久久国模美| 国产欧美日韩一区二区三区在线 | 欧美变态另类bdsm刘玥| 少妇人妻一区二区三区视频| 欧美xxxx黑人xx丫x性爽| av在线app专区| 你懂的网址亚洲精品在线观看| 国产一区亚洲一区在线观看| 麻豆乱淫一区二区| 少妇高潮的动态图| 各种免费的搞黄视频| 在线免费十八禁| 国产在线男女| 国产 一区精品| 女人十人毛片免费观看3o分钟| 国产在线一区二区三区精| 亚洲国产高清在线一区二区三| 五月玫瑰六月丁香| 亚洲欧美日韩另类电影网站 | 国产男女内射视频| 免费看日本二区| 性高湖久久久久久久久免费观看| 亚洲国产欧美在线一区| 国产视频内射| 插逼视频在线观看| 少妇的逼好多水| 亚洲精品成人av观看孕妇| 精品少妇黑人巨大在线播放| 国产一区亚洲一区在线观看| 国产无遮挡羞羞视频在线观看| 国产在线免费精品| 国产成人91sexporn| 国产精品熟女久久久久浪| 秋霞伦理黄片| 日本色播在线视频| 久久久色成人| 99re6热这里在线精品视频| 日韩精品有码人妻一区| 久久久久网色| 日本黄色日本黄色录像| 亚洲av.av天堂| 国产老妇伦熟女老妇高清| 国产精品人妻久久久久久| 国产一区二区三区av在线| 美女福利国产在线 | 久久精品国产自在天天线| 国模一区二区三区四区视频| 亚洲丝袜综合中文字幕| 日韩中字成人| 18禁在线播放成人免费| 免费在线观看成人毛片| av天堂中文字幕网| 欧美国产精品一级二级三级 | 日韩国内少妇激情av| 亚洲欧洲国产日韩| 国产人妻一区二区三区在| 在线精品无人区一区二区三 | 秋霞伦理黄片| 精品一区二区免费观看| 国产精品国产三级国产专区5o| 九草在线视频观看| 激情五月婷婷亚洲| 高清在线视频一区二区三区| av在线观看视频网站免费| 免费观看无遮挡的男女| 99久久精品一区二区三区| 亚洲精品日韩av片在线观看| 亚洲av免费高清在线观看| 亚洲人成网站在线观看播放| 国产精品秋霞免费鲁丝片| 国产淫片久久久久久久久| 亚洲性久久影院| 国产成人91sexporn| 国产美女午夜福利| 成人亚洲欧美一区二区av| 国产精品欧美亚洲77777| 成人二区视频| 国产永久视频网站| 欧美成人a在线观看| 中文字幕久久专区| 3wmmmm亚洲av在线观看| 狠狠精品人妻久久久久久综合| 99视频精品全部免费 在线| 精品一区在线观看国产| 乱码一卡2卡4卡精品| 精品人妻偷拍中文字幕| 欧美zozozo另类| 国产精品麻豆人妻色哟哟久久| 亚洲精品亚洲一区二区| 大香蕉97超碰在线| 看免费成人av毛片| 成人特级av手机在线观看| 欧美极品一区二区三区四区| 亚洲av中文av极速乱| 91久久精品国产一区二区成人| 日本欧美视频一区| 日韩中字成人| a级毛色黄片| 大陆偷拍与自拍| 男女边吃奶边做爰视频| 尤物成人国产欧美一区二区三区| 国产精品福利在线免费观看| 国产精品一区www在线观看| 国产精品一区二区三区四区免费观看| 男女国产视频网站| 精品亚洲乱码少妇综合久久| 欧美亚洲 丝袜 人妻 在线| 美女内射精品一级片tv| 十八禁网站网址无遮挡 | 2021少妇久久久久久久久久久| 麻豆成人av视频| av免费在线看不卡| 中文字幕免费在线视频6| 亚洲av欧美aⅴ国产| 免费观看无遮挡的男女| 尾随美女入室| 国产男人的电影天堂91| 女的被弄到高潮叫床怎么办| 日韩人妻高清精品专区| 一区二区三区乱码不卡18| 人人妻人人看人人澡| 亚洲av.av天堂| 啦啦啦啦在线视频资源| 亚洲精品乱码久久久久久按摩| 免费观看性生交大片5| 亚洲av成人精品一区久久| av免费观看日本| 人妻少妇偷人精品九色| 你懂的网址亚洲精品在线观看| 综合色丁香网| 18+在线观看网站| 高清日韩中文字幕在线| 亚洲成人一二三区av| 寂寞人妻少妇视频99o| 国产黄片视频在线免费观看| 人人妻人人看人人澡| 亚洲av.av天堂| 妹子高潮喷水视频| 三级国产精品欧美在线观看| 国产成人a∨麻豆精品| 亚洲人成网站在线观看播放| 久久人人爽人人爽人人片va| 观看免费一级毛片| 91aial.com中文字幕在线观看| 2021少妇久久久久久久久久久| 伊人久久精品亚洲午夜| 欧美性感艳星| a级一级毛片免费在线观看| 国产成人91sexporn| 成人二区视频| 免费人妻精品一区二区三区视频| 亚洲国产色片| 肉色欧美久久久久久久蜜桃| 毛片一级片免费看久久久久| 亚洲欧美日韩东京热| 欧美少妇被猛烈插入视频| videos熟女内射| 日本午夜av视频| 国产精品一区www在线观看| 久久久久久久久久人人人人人人| 最近中文字幕高清免费大全6| 成人一区二区视频在线观看| 一本色道久久久久久精品综合| 国产精品伦人一区二区| 日产精品乱码卡一卡2卡三| 亚洲欧美清纯卡通| 网址你懂的国产日韩在线| 久久影院123| 亚洲av成人精品一区久久| 日韩精品有码人妻一区| 晚上一个人看的免费电影| 在线观看av片永久免费下载| 在现免费观看毛片| 美女脱内裤让男人舔精品视频| 最近中文字幕高清免费大全6| 少妇的逼水好多| 亚州av有码| 精品99又大又爽又粗少妇毛片| 久久久午夜欧美精品| 老司机影院成人| 国产高清不卡午夜福利| 亚洲aⅴ乱码一区二区在线播放| 日日摸夜夜添夜夜添av毛片| 黑人猛操日本美女一级片| 久久久久久久亚洲中文字幕| 一级毛片 在线播放| 国产v大片淫在线免费观看| 天天躁夜夜躁狠狠久久av| 亚洲va在线va天堂va国产| 永久免费av网站大全| 狂野欧美激情性xxxx在线观看| 国产精品麻豆人妻色哟哟久久| 精品国产三级普通话版| 晚上一个人看的免费电影| 只有这里有精品99| 青春草国产在线视频| 久久久久久九九精品二区国产| 婷婷色综合www| 国产黄片美女视频| 91精品伊人久久大香线蕉| 亚洲久久久国产精品| 伦理电影免费视频| 免费看不卡的av| 免费久久久久久久精品成人欧美视频 | 免费人成在线观看视频色| 亚洲国产精品专区欧美| 欧美成人午夜免费资源| 亚洲欧洲国产日韩| 亚洲不卡免费看| 亚洲熟女精品中文字幕| 一级av片app| 亚洲真实伦在线观看| 亚洲欧美一区二区三区国产| 日韩av不卡免费在线播放| 毛片女人毛片| 夜夜看夜夜爽夜夜摸| 卡戴珊不雅视频在线播放| 国产毛片在线视频| 欧美性感艳星| 亚洲,欧美,日韩| 成人特级av手机在线观看| 大又大粗又爽又黄少妇毛片口| 丰满人妻一区二区三区视频av| 亚洲图色成人| 亚洲av成人精品一二三区| 日本黄色片子视频| 国产亚洲5aaaaa淫片| 男人爽女人下面视频在线观看| 亚洲成人av在线免费| 欧美zozozo另类| 日韩,欧美,国产一区二区三区| 舔av片在线| 另类亚洲欧美激情| 亚洲精品国产色婷婷电影| 成年女人在线观看亚洲视频| 亚洲国产欧美人成| a级一级毛片免费在线观看| 久久久国产一区二区| 成人美女网站在线观看视频| 国产成人精品婷婷| 精品一品国产午夜福利视频| 国产精品免费大片| 久热这里只有精品99| 五月玫瑰六月丁香| 一级毛片aaaaaa免费看小| 大码成人一级视频| 夜夜爽夜夜爽视频| 高清av免费在线| 青春草视频在线免费观看| 日韩三级伦理在线观看| 国产淫语在线视频| 国产精品三级大全| 精品酒店卫生间| 亚洲精品久久午夜乱码| 五月开心婷婷网| 亚洲aⅴ乱码一区二区在线播放| 亚洲国产日韩一区二区| 久久99热这里只频精品6学生| 国产成人一区二区在线| 啦啦啦中文免费视频观看日本| 免费看光身美女| 久久99热这里只频精品6学生| 亚洲在久久综合| 观看美女的网站| 99re6热这里在线精品视频| 综合色丁香网| 少妇高潮的动态图| 欧美日韩国产mv在线观看视频 | 丝袜脚勾引网站| 在线看a的网站| 丰满少妇做爰视频| 免费av不卡在线播放| 一区二区三区四区激情视频| 大香蕉久久网| 久久 成人 亚洲| 亚洲中文av在线| 嫩草影院新地址| 久久精品国产亚洲av涩爱| 国产亚洲最大av| 国产av一区二区精品久久 | 中文字幕免费在线视频6| 亚洲成人中文字幕在线播放| 亚洲精品国产av蜜桃| 一区在线观看完整版| 欧美丝袜亚洲另类| 自拍偷自拍亚洲精品老妇| 欧美xxxx性猛交bbbb| 人妻夜夜爽99麻豆av| 色婷婷久久久亚洲欧美| 我的老师免费观看完整版| 国产亚洲精品久久久com| 欧美三级亚洲精品| 水蜜桃什么品种好| 欧美亚洲 丝袜 人妻 在线| 色视频www国产| 久久99精品国语久久久| 少妇熟女欧美另类| 只有这里有精品99| 免费看日本二区| 国产成人精品婷婷| 成人无遮挡网站| 成人亚洲欧美一区二区av| 少妇裸体淫交视频免费看高清| 99国产精品免费福利视频| 日本猛色少妇xxxxx猛交久久| 黄色欧美视频在线观看| 日韩中文字幕视频在线看片 | 亚州av有码| 免费黄频网站在线观看国产| 国产亚洲91精品色在线| a级一级毛片免费在线观看| 亚洲欧洲日产国产| 久久久欧美国产精品| 亚洲精品国产色婷婷电影| 人人妻人人看人人澡| 伦理电影免费视频| 国产伦在线观看视频一区| 欧美日韩视频高清一区二区三区二| 国产成人一区二区在线| 久久99热这里只频精品6学生| 亚洲最大成人中文| 国产成人免费无遮挡视频| 三级经典国产精品| 一本久久精品| 我的老师免费观看完整版| 精品一区二区三区视频在线| 亚洲精品日韩在线中文字幕| 热re99久久精品国产66热6| 一区二区三区精品91| 亚洲欧美日韩另类电影网站 | av国产久精品久网站免费入址| 国产精品久久久久久久电影| 狂野欧美白嫩少妇大欣赏| 街头女战士在线观看网站| 国产91av在线免费观看| 亚洲精品色激情综合| 在线观看免费视频网站a站| 舔av片在线| 九草在线视频观看| 国产精品久久久久成人av| 交换朋友夫妻互换小说| 国产视频内射| 日本wwww免费看| 国产日韩欧美在线精品| 中国国产av一级| 国产精品一区二区三区四区免费观看| 高清欧美精品videossex| 天天躁夜夜躁狠狠久久av| 国产亚洲一区二区精品| 久久99热这里只有精品18| 国产色婷婷99| 国产精品一区二区在线不卡| 亚洲色图av天堂| av不卡在线播放| 国产乱人视频| 中文字幕制服av| 国产午夜精品一二区理论片| 国产高清国产精品国产三级 | 美女cb高潮喷水在线观看| 人妻夜夜爽99麻豆av| 建设人人有责人人尽责人人享有的 | 丰满迷人的少妇在线观看| 国产又色又爽无遮挡免| 高清不卡的av网站| 欧美日韩一区二区视频在线观看视频在线| 久久精品夜色国产| 日韩精品有码人妻一区| 成人综合一区亚洲| 色视频在线一区二区三区| 成人亚洲精品一区在线观看 | 波野结衣二区三区在线| 在线 av 中文字幕| 亚洲美女黄色视频免费看| 久久精品国产鲁丝片午夜精品| 久久久国产一区二区| 高清视频免费观看一区二区| 18禁在线无遮挡免费观看视频| 1000部很黄的大片| 嘟嘟电影网在线观看| 我的女老师完整版在线观看| 蜜桃在线观看..| 国产免费视频播放在线视频| 国产女主播在线喷水免费视频网站| 99热全是精品| 菩萨蛮人人尽说江南好唐韦庄| 国产色婷婷99| 亚洲精品久久午夜乱码| 天堂中文最新版在线下载| 人妻系列 视频| 最新中文字幕久久久久| 免费人成在线观看视频色| 在线观看美女被高潮喷水网站| 成人亚洲欧美一区二区av| 国产成人aa在线观看| av免费观看日本| 99热这里只有是精品50| 深爱激情五月婷婷| 国产在线免费精品| 麻豆成人午夜福利视频| 91aial.com中文字幕在线观看| 极品少妇高潮喷水抽搐| 国产精品欧美亚洲77777| 国产欧美亚洲国产| 中文乱码字字幕精品一区二区三区| 男女边摸边吃奶| 纯流量卡能插随身wifi吗| 自拍欧美九色日韩亚洲蝌蚪91 | 国产亚洲5aaaaa淫片| 99久久人妻综合| 免费久久久久久久精品成人欧美视频 | 亚洲人成网站在线观看播放| 少妇猛男粗大的猛烈进出视频| 高清黄色对白视频在线免费看 | 日韩国内少妇激情av| 国产成人a∨麻豆精品| 亚洲成色77777| 亚洲婷婷狠狠爱综合网| 国产乱人视频| 高清毛片免费看| 精品久久久久久电影网| 乱码一卡2卡4卡精品| 大码成人一级视频| 一个人看视频在线观看www免费| 亚洲欧美清纯卡通| 中文欧美无线码| 观看免费一级毛片| 国产精品女同一区二区软件| 亚洲国产高清在线一区二区三| 建设人人有责人人尽责人人享有的 | 美女福利国产在线 | 日本一二三区视频观看| 亚洲精品,欧美精品| 精品亚洲乱码少妇综合久久| 日韩强制内射视频| 久久久久久久久久人人人人人人| 简卡轻食公司| 亚洲成人中文字幕在线播放| 大又大粗又爽又黄少妇毛片口| 久久99热这里只频精品6学生| 亚洲欧美一区二区三区黑人 | 尾随美女入室| 我要看黄色一级片免费的| 街头女战士在线观看网站| 丝袜喷水一区| 校园人妻丝袜中文字幕| 寂寞人妻少妇视频99o| 欧美3d第一页| 大码成人一级视频| 看免费成人av毛片| 乱系列少妇在线播放| 搡女人真爽免费视频火全软件| 少妇的逼水好多| 一区二区三区免费毛片| 国产精品一区二区性色av| 国产69精品久久久久777片| 久久久久视频综合| 国产一区二区三区综合在线观看 | 只有这里有精品99| 亚洲国产毛片av蜜桃av| 久久青草综合色| 国产免费一级a男人的天堂| 一级av片app| 高清av免费在线| 看十八女毛片水多多多| 51国产日韩欧美| 亚洲国产精品专区欧美| 国产成人一区二区在线| 五月天丁香电影| 精品国产乱码久久久久久小说| 80岁老熟妇乱子伦牲交| 精品久久久精品久久久| av播播在线观看一区| tube8黄色片| 91在线精品国自产拍蜜月| 91久久精品电影网| 久久精品国产a三级三级三级| av一本久久久久| 99精国产麻豆久久婷婷| 亚洲成人一二三区av| 国产国拍精品亚洲av在线观看| 伦理电影大哥的女人| 狂野欧美激情性bbbbbb| 一级毛片 在线播放| 亚洲在久久综合| 国产 精品1| 欧美bdsm另类| 国产精品人妻久久久久久| 美女cb高潮喷水在线观看| 欧美日韩精品成人综合77777| 亚洲av中文字字幕乱码综合| 性色avwww在线观看| 国产精品无大码| 亚洲丝袜综合中文字幕| 嫩草影院入口| 中文欧美无线码| 男女无遮挡免费网站观看| 欧美成人a在线观看| 国产精品福利在线免费观看| 美女中出高潮动态图| 精品久久久久久久末码| 一区二区三区精品91| 男人添女人高潮全过程视频| 99久久综合免费| 国产一区二区三区av在线|