• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    A Survey of Accelerator Architectures for Deep Neural Networks

    2020-09-14 03:41:54YirnChenYunXieLinghoSongFnChenTinqiTng
    Engineering 2020年3期

    Yirn Chen*, Yun Xie, Lingho Song Fn Chen Tinqi Tng

    a Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA

    b Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93106-9560, USA

    Keywords:Deep neural network Domain-specific architecture Accelerator

    A B S T R A C T Recently, due to the availability of big data and the rapid growth of computing power, artificial intelligence (AI) has regained tremendous attention and investment. Machine learning (ML) approaches have been successfully applied to solve many problems in academia and in industry. Although the explosion of big data applications is driving the development of ML, it also imposes severe challenges of data processing speed and scalability on conventional computer systems. Computing platforms that are dedicatedly designed for AI applications have been considered, ranging from a complement to von Neumann platforms to a‘‘must-have”and stand-alone technical solution.These platforms,which belong to a larger category named ‘‘domain-specific computing,” focus on specific customization for AI. In this article, we focus on summarizing the recent advances in accelerator designs for deep neural networks(DNNs)—that is,DNN accelerators.We discuss various architectures that support DNN executions in terms of computing units, dataflow optimization, targeted network topologies, architectures on emerging technologies,and accelerators for emerging applications. We also provide our visions on the future trend of AI chip designs.

    1. Introduction

    Classical philosophy described the process of human thinking as the mechanical manipulation of symbols. For a very long time,humans have been attempting to create artificial beings endowed with the intelligence of consciousness, which was the initial seed of artificial intelligence (AI). In 1950, Alan Turing mathematically discussed the possibility of implementing an intelligent machine,and proposed ‘‘the imitation game,” which was later known as the ‘‘Turing test” [1]. The Dartmouth summer research project on AI [2], which took place in 1956, is generally considered to be the official seminal event for AI as a new research field.In the subsequent decades, AI experienced several ups and downs. Very recently, due to the availability of big data and the rapid growth of computing power, AI has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully applied to solve many problems in academia [3,4] and in industry [5].

    ML algorithms (including biologically plausible models) originally emulate the behavior of a biological brain explicitly [6]. The human brain is currently regarded as the most intelligent‘‘machine,” and has extremely high structural complexity and operational efficiency. Similar to biological nervous systems, two basic function units in an ML algorithm are synapses and neurons,which are responsible for information processing and feature extraction, respectively. Compared to synapses, there are more types of neuron models, such as the McCulloch-Pitts [6], sigmoid,ReLU, and Integrate-and-Fire models [7]. These neuron models all have certain nonlinear characteristics that are required for both feature extraction and neural network (NN) training. Later,‘‘biologically inspired” models were invented as mathematical approaches to realize high-level functions [8]. In general, modern ML algorithms can be divided into two categories:artificial neural networks (ANNs), where data is represented as numerical values[9],and spiking neural networks(SNNs),where data is represented by spikes [10].

    Although the explosion of big data applications is driving the development of ML, it also imposes severe challenges of data processing speed and scalability on conventional computer systems. More specifically, traditional von Neumann computers have separate processing and data storage components. The frequent data movement required between processors and off-chip memory limits the system performance and energy efficiency, which has been further exacerbated by the skyrocketing volume of data in AI applications. Computing platforms that are dedicatedly designed for AI applications have been considered, ranging from a complement to von Neumann platforms to a ‘‘must-have” and stand-alone technical solution. These platforms, which belong to a larger category named ‘‘domain-specific computing,” focus on specific customization for AI. Orders-of-magnitude power and performance efficiency improvements have been accomplished by overcoming the well-known ‘‘memory wall” [11] and ‘‘power wall” [12] challenges. Recent AI-specific computing systems—that is, AI accelerators—are often constructed with a large number of highly parallel computing and storage units. These units are organized in a two-dimensional (2D) way to support common matrix-vector multiplications in NNs. Network-on-chip (NoC)[13], high bandwidth memory (HBM) [14], data reuse [15], and so forth are applied to further optimize the data traffic in these accelerators.Innovations at three levels—namely,biological theory foundation, hardware design, and algorithms (software)—serve as the three cornerstones of AI accelerators. This article summarizes recent advances in AI accelerators for both data centers [5,16,17]and edge devices [18-20].

    Besides conventional complementary-symmetry metaloxide-semiconductor (CMOS) designs, emerging nonvolatile memories, such as metal-oxide resistive random access memory(ReRAM), have been recently explored in AI accelerator designs.These emerging memories feature high storage density and fast access time,as well as the potential to implement in-memory computing[21-23].More specifically,ReRAM arrays can not only store NNs, but also perform in situ matrix-vector multiplications in an analog manner. Compared with state-of-the-art CMOS designs,ReRAM-based AI accelerators can achieve 3-4 orders of magnitude higher computation efficiency[24]due to the low-power nature of analog computation. The noisy analog operation, on the other hand,can be largely tolerated by ML algorithms,as they show great immunity to noise and faults. However, the conversion between analog signals in ReRAM crossbars and digital values in other digital units in the accelerators requires digital-analog converters(DACs) and analog-digital convertors (ADCs), which cost up to 66.4% of power consumption and 73.2% of area overhead in ReRAM-based NN accelerators [25].

    In this article,we mainly focus on ANNs.In particular,we summarize the recent advances in accelerator designs for deep neural networks (DNNs)—that is, DNN accelerators. We discuss various architectures that support DNN executions in terms of computing units, dataflow optimization, targeted network topologies, and so forth. This article is organized as follows. Section 2 introduces the basics of ML and DNNs.Section 3 and Section 4 present several representative DNN on-chip and stand-alone accelerators, respectively. Section 5 describes various DNN accelerators implemented with emerging memories. Section 6 briefly summarizes DNN accelerators for emerging applications. Section 7 provides our visions on the future trend of AI chip designs.

    2. Background

    This section presents some background on DNNs and on several important concepts that form a basis for the contents discussed in this article. It also gives a brief introduction of the emerging ReRAM and its application in neural computation.

    2.1. Inference and training of DNNs

    In general,a DNN is a parameterized function that takes a highdimensional input to make useful predictions—that is, a classification label. This prediction process is called inference. To obtain a meaningful set of parameters, training of the DNN is performed on a training dataset, and the parameters are optimized via approaches such as stochastic gradient descent (SGD) in order to minimize a certain loss function. In each training step, a forward pass is first performed to calculate the loss,followed by a backward pass to back-propagate the error. Finally, the gradient of each parameter is computed and accumulated. To fully optimize a large-scale DNN, the training process can take a million steps or more.

    A DNN is typically a stack of NN layers.If we denote the lth layer as a function fl, then the inference of an L-layer DNN can be expressed by the following:

    in which x is the input.In this case,the output of each layer is only used by the next layer, and the whole computation has no back trace. The dataflow of a DNN inference is in the form of a chain and can be efficiently accelerated in hardware without extra demand on memory. This property is true for both feed-forward NNs and recurrent neural networks (RNNs). The ‘‘recurrent” structure can be treated as a variable-length feed-forward structure with temporal reuse of one layer’s weight,and the dataflow still forms a chain.

    In DNN training,the data dependency is twice as deep as it is in inference.Although the dataflow of the forward pass is the same as the inference, the backward pass then executes the layers in a reversed order.Moreover,the outputs of each layer in the forward pass are reused in the backward pass to calculate the errors(because of the chain rule of back-propagation), resulting in many long data dependencies.Fig.1 illustrates how the training dataflow differs from the inference. A DNN may include convolutional layers, fully connected layers (batched matrix multiplications),and point-wise operation layers such as ReLU, sigmoid, max pooling, and batch normalization. The backward pass may have point-wise operations,the forms of which are different from those of a forward pass. Matrix multiplications and convolutions also retain their computation pattern unchanged in the backward pass;the main difference is that they perform on the transposed weight matrix and rotated convolutional kernel, respectively.

    2.2. Computation patterns

    Although a DNN may include many types of layers, matrix multiplications and convolutions dominate over 90% of the operations, and are the main targets of DNN accelerator designs.For a matrix multiplication,if we use Ic,Oc,B to denote the number of input channels, number of output channels, and batch size,respectively, the computation can be written as follows:

    Fig. 1. DNN training dataflow in PipeLayer [22]. Each arrow represents a data dependency.T:logical cycle;L:the ground-truth label;d:the feature map;A:array;W: weight; δ: error; W′: the reformed W.

    where icis the index of the input channel,ocis the index of the output channel,and b is the index of samples in a batch.For 0 ≤b

    Convolutions in DNN can be viewed as an extended version of matrix multiplications, which adds the properties of local connectivity and translation invariance. Compared with matrix multiplications, in convolutions, each input element is replaced by a 2D feature map, and each weight element is replaced by a 2D convolutional kernel(or filter).Then,the computation is based on sliding windows: As shown in Fig. 2, starting from the top-left corner of the input feature map,the filter slides toward the right end.When it reaches the right end of the feature map,it moves back to the left end and shifts to the next row. The formal description is shown below:

    where Fhis the height of the filter, Fwis the width of the filter, i is the index of the row in a 2D filter, j is the index of the column in a 2D filter,x is the index of the row in a 2D feature map,y is the index of the column in a 2D feature map. For 0 ≤b < B, 0 ≤oc< Oc,0 ≤x < Oh, 0 ≤y < Ow(Oh: the height of the output feature map;Ow: the width of the output feature map).

    To provide translation invariance, the same convolutional filter is repeatedly applied to all the parts of the input feature map,making the data reuse pattern in convolutions much more complex than in matrix multiplications. For easier hardware implementation,it is better to view the 2D sliding window in a two-level hierarchy: The first level is a window of several rows that slides downward to provide inter-row data reuse, and the second level is a window of several elements that slides rightward to provide intra-row data reuse.

    Although the computation patterns of matrix multiplications and convolutions are very different,they can actually be converted to each other.Thus,an accelerator designed for one type of computation can still support the other type,although doing so might not be very efficient. Convolutions can be transformed into matrix multiplication through the Toeplitz matrix, as illustrated in Fig.3,at the cost of introducing redundant data.On the other hand,matrix multiplication is just a convolution with Oh=Ow=Fw=Fh=1.The feature map and the filter are reduced to a single element.

    2.3. Resistive memory

    The memristor, also known as the ReRAM,is an emerging nonvolatile memory that stores information using cell resistances. In 2008, HP Lab reported its discovery of a nanoscale memristor based on TiO2thin-film devices [27]. Since then, many resistive materials and structures have been found or rediscovered.

    Fig. 2. Two levels of sliding windows in 2D convolution [26]. *: convolution.

    Fig. 3. Converting convolution to Toeplitz matrix multiplication. *: convolution.

    As demonstrated in Fig.4(a),each ReRAM cell has a metal-oxide layer sandwiched between a top electrode(TE)and a bottom electrode (BE). The resistance of a memristor cell can be programmed by applying a current or voltage with an appropriate pulse width or magnitude. In particular, the data stored in a cell can be represented by the resistance level accordingly: A low resistance state(LRS) denotes bit ‘‘1,” while a high resistance state (HRS) denotes bit ‘‘0.” For a read operation, a small sensing voltage is applied across the device;the amplitude of the current is then determined by the resistance.

    In 2012, HP Lab proposed a ReRAM crossbar structure that has demonstrated an attractive capability to efficiently accelerate matrix-vector multiplications in NNs. As shown in Fig. 4(b), the vector is represented by the input signals on the wordlines(WLs).Each element of the matrix is programmed into the conductance of a cell in the crossbar array. Thus, the current summing at the end of each bitline (BL) is viewed as the result of the matrixvector multiplication. For a large matrix that cannot fit in a single array, the input and the output are partitioned and grouped into multiple arrays. The output of each array is a partial sum, which is collected horizontally and summed vertically to generate the actual results.

    3. On-chip accelerators

    In the early stage of DNN accelerator design, accelerators were designed for the acceleration of approximate programs in generalpurpose processing [28], or for small NNs[13].Although the functionality and performance of on-chip accelerators were very limited, they revealed the basic idea of AI-specialized chips. Because of the limitations of general-purpose processing chips, it is often necessary to design specialized chips for AI/DNN applications.

    3.1. The neural processing unit

    The neural processing unit (NPU) [28] is designed to use hardwarelized on-chip NNs to accelerate a segment of a program instead of running on a central processing unit (CPU).

    The hardware design of the NPU is quite simple. An NPU consists of eight processing engines (PEs), as shown in Fig. 5. Each PE performs the computation of a neuron; that is, multiplication,accumulation, and sigmoid. Thus, what the NPU performs is the computation of a multiple layer perceptron (MLP) NN.

    The idea of using the hardwarelized MLP—that is, the NPU—to accelerate some program segments was very inspiring. If a program segment is ①frequently executed and ②approximable,and if ③ the inputs and outputs are well defined, then that segment can be accelerated by the NPU. To execute a program on the NPU, programmers need to manually annotate a program segment that satisfies the above three conditions. Next, the compiler will compile the program segment into NPU instructions and the computation tasks are offloaded from the CPU to the NPU at runtime. Sobel edge detection and fast fourier transform (FFT)are two examples of such program segments. An NPU can reduce up to 97% of the dynamic CPU instructions and achieve a speedup of up to 11.1 times.

    Fig.4. The ReRAM basics.(a)ReRAM cell;(b)2D ReRAM crossbar.V:write voltage;Vr:read voltage;BL:bitline;WL:wordline;TE:top electrode;BE:bottom electrode;LRS:low resistance state; HRS: high resistance state.

    Fig. 5. The NPU [28]. (a) Eight-PE NPU; (b) single PE. FIFO: first in, first out.

    3.2. RENO: A reconfigurable NoC accelerator

    Unlike the NPU,which is designed for the acceleration of generalpurpose programs,RENO[13]is an accelerator for NNs.RENO uses a similar idea for the PE design,as shown in Fig.6.However,the PE of RENO is based on ReRAM:RENO utilizes the ReRAM crossbar as the basic computation unit to perform matrix-vector multiplications.Each PE consists of four ReRAM crossbars,which respectively correspond to the processing of positive and negative inputs and positive and negative weights.In RENO,routers are deployed to coordinate data transfer between the PEs. Unlike conventional CMOS routers,the routers of RENO transfer the analog intermediate computation results from the previous neuron to the following neuron.In RENO,only the input and final output are digital;the intermediate results are all analog and are coordinated by the analog routers.Data converters (DACs and ADCs) are required only when transferring data between RENO and the CPU.

    RENO supports the processing of the MLP and the autoassociate memory (AAM), and corresponding instructions are designed for the pipelining of RENO and a CPU. Because RENO is an on-chip design, the supported applications are limited. RENO supports the processing of small datasets, such as the UCI ML repository [29] and the tailored Modified National Institute of Standards and Technology (MNIST) database [30].

    4. Stand-alone DNN/convolutional neural network accelerator

    For broadly used DNN and convolutional neural network(CNN) applications, stand-alone domain-specific accelerators have achieved great success in both cloud and edge scenarios. Compared with general-purpose CPUs and graphics processing units(GPUs), these custom architectures offer better performance and higher energy efficiency. Custom architectures usually require a deep understanding of the target workloads. The dataflow (or data reuse pattern)is carefully analyzed and utilized in the design to reduce the off-chip memory access and improve the system efficiency.

    In this section,we respectively use the DianNao series[31]and the tensor processing unit (TPU) [5] as academic and industrial examples to explain stand-alone accelerator designs and discuss the dataflow analysis.

    4.1. The DianNao series: An academic example

    The DianNao series includes multiple accelerators, listed in Table 1[31].DianNao is the first design of the series.It is composed of the following components, as shown in Fig. 7:

    (1) A computational block neural functional unit (NFU), which performs computations;

    (2) An input buffer for input neurons (NBin);

    (3) An output buffer for output neurons (NBout);

    (4) A synapse buffer for synaptic weights (SB);

    (5) A control processor (CP).

    Fig.6. RENO architecture[13].MBC:memristor-based crossbars;Vi+:positive input voltage;Vi-:negative input voltage;M+:the MBC mapped with positive weights;M-:the MBC mapped with negative weights; Sum amp: summation amplifier.

    Table 1 DianNao series accelerators [31].

    Fig. 7. The DianNao architecture [18]. DMA: direct memory access; Inst.: instructions; Tn: the number of output neurons.

    The NFU,which includes multipliers,adder trees,and nonlinear functional units, is designed as a pipeline. Rather than a normal cache, a scratchpad memory is used as on-chip storage because it can be controlled by the compiler and can easily explore the data locality.

    While efficient computing units are important for a DNN accelerator, inefficient memory transfers can also affect the system throughput and energy efficiency. The DianNao series introduces a special design to minimize memory transfer latency and enhance system efficiency. DaDianNao [16] targets the datacenter scenario and integrates a large on-chip embedded dynamic random access memory (eDRAM) to avoid a long main-memory access time. The same principle applies to the embedded scenario. ShiDianNao[19] is a DNN accelerator dedicated to CNN applications. Because of weight sharing,a CNN’s memory footprint is much smaller than that of other DNNs.It is possible to map all of the CNN parameters onto a small on-chip static random access memory (SRAM) when the CNN model is small. In this way, ShiDianNao avoids expensive off-chip dynamic random access memory (DRAM) access and achieves a 60 times energy efficiency in comparison with DianNao.

    PuDianNao [17] is designed for multiple ML applications. In addition to supporting DNNs, it supports other representative ML algorithms, such as k-means and classification trees. To deal with the different data-access patterns of these workloads, PuDianNao introduces a cold buffer and a hot buffer for data with different reuse distances in its architecture. Moreover, compilation techniques, including loop unrolling, loop tiling, and cache blocking,are introduced as a software-and-hardware co-design method to increase the on-chip data reuse and the PE utilization ratios.

    On top of the stand-alone accelerators, a domain-specific instruction set architecture (ISA), called Cambricon [32], was proposed to support a broad range of NN applications. Cambricon is a load-store architecture that integrates scalar, vector, matrix,logical, data transfer, and control instructions. The ISA design considers data parallelism, customized vector/matrix instructions,and the use of scratchpad memory.

    The successors of the Cambricon series introduce support to sparse NNs. Other accelerators support more complicated NN workloads,such as long short-term memory(LSTM)and generative adversarial network(GAN).These works will be discussed in detail in Section 6.

    4.2. The TPU: An industrial example

    Fig. 8. TPU block diagram [5]. DDR3: double-data-rate 3; PCle: Peripheral Component Interconnect Express.

    Highlighted with a systolic array,as shown in Fig.8,google published its first tpu paper (tpu1) in 2017 [5]. tpu1 focuses on inference tasks, and has been deployed in google’s datacenter since 2015. the structure of the systolic array can be regarded as a specialized weight-stationary dataflow, or a 2d single instruction multiple data (simd) architecture.ed with a systolic array, as shown in Fig. 8, Google published its first TPU paper (TPU1) in 2017 [5]. TPU1 focuses on inference tasks, and has been deployed in Google’s datacenter since 2015. The structure of the systolic array can be regarded as a specialized weight-stationary dataflow,or a 2D single instruction multiple data (SIMD) architecture.

    After that, in Google I/O’17 [33], Google announced its cloud TPU (also known as TPU2), which can handle both training and inference in the datacenter. TPU2 also adopts a systolic array and introduces vector-processing units. In Google I/O’18 [34], Google announced TPU3, which is highlighted as having liquid cooling.In Google Cloud Next’18 [35], Google announced its edge TPU,which targets the inference tasks of the Internet of Things (IoT).

    4.3. Dataflow analysis and architecture design

    A DNN/CNN generally requires a large memory footprint. For large and complicated DNN/CNN models, it is unlikely that the whole model can be mapped onto the chip. Due to the limited off-chip bandwidth, it is of vital importance to increase on-chip data reuse and reduce the off-chip data transfer in order to improve the computing efficiency. During architecture design,dataflow analysis is performed and special consideration needs to be taken. As shown in Fig. 9 [15,36], Eyeriss explored different NN dataflows, including input-stationary (IS), output-stationary(OS), weight-stationary (WS), and no-local-reuse (NLR) dataflows,in the context of a spatial architecture and proposed the rowstationary (RS) dataflow to enhance data reuse.

    Fig. 9. Row-stationary dataflow [15,36].

    The high-efficient dataflow design inspired many practical designs in the AI chip industry.For example,WaveComputing features a coarse-grained reconfigurable array(CGRA)-based dataflow processor [37]. GraphCore focuses on graph architecture and is claimed to be able to achieve higher performance than the traditional scalar processor and vector processor on AI workloads [38].

    5. Accelerators with emerging memories

    ReRAM[27]and the hybrid memory cube(HMC)[39]are representative emerging memory technologies and memory structures that enable processing-in-memory (PIM). PIM can greatly reduce the data movements in computing platforms, as the data movement between CPUs and off-chip memory consumes two orders of magnitude more energy than a floating point operation [40].DNN accelerators can take these benefits from ReRAM and HMC and apply PIM to accelerate DNN executions.

    5.1. ReRAM-based DNN accelerators

    The key idea of utilizing ReRAM for DNN acceleration is to use the ReRAM array as a computation engine for matrix-vector multiplications[41,42],as mentioned in Section 2.3.PRIME[21],ISAAC[25], and PipeLayer [22] are three representative ReRAM-based DNN accelerators.

    The architecture of PRIME [21] is shown in Fig. 10. PRIME revises the ReRAM for both data storage and computation. In PRIME,the wordline decoders and drivers are configured with multilevel voltage sources, so the input feature map can be applied to the memory array in computation.The column multiplexer is configured with analog subtraction and sigmoid circuitry;thus,partial results from two arrays are combined and sent to the nonlinear activation (sigmoid). The sense amplifier is also reconfigurable in sensing resolution, and performs the functionality of an ADC.

    Fig.10. PRIME architecture [21].GWL: global word line; SA: sense amplifier; WDD:wordline decoder and driver; GDL:global data line;IO:input and output;Vol.: voltage source; Col mux.: column multiplexer.

    ISAAC[25]proposes an intra-tile pipeline design for NN processing in ReRAM,as shown in Fig.11.The pipeline design combines data encoding and computation.IMA is the ReRAM-based in situ multiply accumulate unit. In the pipeline, in the first cycle, data is fetched from eDRAM to the computation tile. The data format in ISAAC is fixed 16 bit. In computation, in each cycle, 1 bit is input to the IMA,and the computation result from the IMA is converted to digital format,shifted by 1 bit,and accumulated.Therefore,it takes another 16 cycles to process the input. The nonlinear activation is then applied,and the results are written back to eDRAM.

    Tiled computation architecture is a natural and widely used way to process NNs. It is necessary to explore coarse-grained parallel designs to improve the accelerators’ throughput. PipeLayer[22] introduces intra-layer parallelism and an inter-layer pipeline for tiled architecture, as illustrated earlier in Fig. 1. For intralayer parallelism, PipeLayer uses a data-parallel scheme, which duplicates processing units with the same weights to process multiple data in parallel. For the inter-layer pipeline, buffers are delicately employed for data sharing between weighted layers. As a result, the processing of layers from multiple data can be parallelized. The inter-layer pipeline is a model parallel scheme.

    5.2. HMC-based DNN accelerators

    An HMC vertically integrates DRAM dies and the logic die. The high memory capacity, high memory bandwidth, and low latency offered by an HMC enables near-data processing. In an HMCbased accelerator design, computation and logic units are placed on the logic die, and the DRAM dies in a vault are used for data storage. Neurocube [43] and Tetris [44] are two DNN accelerators based on an HMC.

    A whole accelerator in Neurocube has one HMC with 16 vaults[43].As shown in Fig.12,each vault can be viewed as a subsystem,which consists of a PE to perform multiply-accumulation (MAC)and a router for package transferring between the logic die and the dies of DRAM. Each vault can send data to a destination vault by the router, which enables out-of-order data arrival. For each PE,if the buffer(16 entries)is filled with data,the computation will start.

    Tetris[44]also employs 16 PEs in one HMC,but it uses a spatial mesh to connect the PEs. Tetris proposes a bypassing ordering scheme, which is similar to the data stationary scheme discussed in Refs. [15,36], to improve data reuse. To minimize data remote access,Tetris explores the partitioning of input and output feature maps.

    6. Accelerators for emerging applications

    Fig.11. Intra-tile pipeline in ISAAC[25].Rd:read;IR:input register;Xbar:crossbar;S+A:shift and add;OR:output register;Wr:write;σ:sigmoid unit;IMA:the ReRAMbased in situ multiply accumulate unit.

    Fig. 12. Neurocube architecture and PE design [43]. VC: vault controller; PNG:programmable neurosequence generator; TSV: through silicon via; A, B, C: input operands; Y: output operand; R: register;μ: microcontroller.

    The efficiency of DNN accelerators can be also improved by applying efficient NN structures. NN pruning, for example, makes the model small yet sparse, thus reducing the off-chip memory access. The NN quantization allows the model to operate in a low-precision mode, thus reducing the required storage capacity and computational cost. Emerging applications, such as the GAN and the RNN, raise special requirements for dedicated accelerator designs. This section discusses accelerator designs for the sparse NN, low precision NN, GAN, and RNN.

    6.1. Sparse neural network

    Previous work in dense-sparse-dense(DSD)[46]has shown that a large proportion of NN connections can be pruned to zero with or without minimum accuracy loss. Many corresponding computing architectures have also been proposed. For example, EIE [47] and Cnvlutin [48] respectively target accelerating the computations of NN models with sparse weight matrices and sparse feature maps.However, the special data format and extra encoder/decoder adopted in these designs introduce additional hardware cost.Some algorithm works discuss how to design NN models in a hardwarefriendly way, such as by using block sparsity, as shown in Fig. 13

    [45]. Techniques that can handle irregular memory access and an unbalanced workload in sparse NN have also been proposed. For example, Cambricon-X [49] and Cambricon-S [50] address the memory access irregularity in sparse NNs through a cooperative software/hardware approach. ReCom [51] proposes a ReRAMbased sparse NN accelerator based on structural weight/activation compression.

    6.2. Low precision neural network

    Reducing data precision or quantization, is another viable way to improve the computing efficiency of DNN accelerators. The recent TensorRT results[52]show that the widely used NN models,including AlexNet, VGG, and ResNet, can be quantized to 8 bit without inference accuracy loss. However, it is difficult for such a unified quantization strategy to retain the network’s accuracy when further lower precision is adopted.Many complex quantization schemes have been proposed, however, significantly increasing the hardware overhead of the quantization encoder/decoder and the workload scheduler in the accelerator design. As shown in the following discussion, a ‘‘sweet point” exists between data precision and overall system efficiency with various optimizations.

    (1) The weight and feature map are quantized into different precisions to achieve lower inference accuracy loss. This may change the original dataflow and affect the accelerator architecture, especially the scratchpad memory.

    (2)Different layers or different data may require different quantization strategies.In general,the first and the last layer of the NN require higher precision. This fact increases the design complexity of the quantization encoder/decoder and the workload scheduler.

    (3) New quantization schemes have been proposed by observing the characteristics of the data distribution. For example, an outlier-aware accelerator [53] performs dense and low-precision computations for a majority of data (weights and activations)while efficiently handling a small number of sparse and highprecision outliers.

    (4) A new data format has been proposed to better represent low-precision data.For example,the compensated DNN[54]introduces a new fixed-point representation: fixed point with error compensation (FPEC). This representation has two parts: ①computation bits, which are the conventional fixed-point format; and②compensation bits,which represent the quantization error.This work also proposes a low-overhead sparse compensation scheme to estimate the error in the MAC design.

    6.3. Generative adversarial network

    Compared with the original DNN/CNN,the GAN consists of two NNs: namely, the generator and the discriminator. The generator learns to produce fake data,which is supplied to the discriminator,and the discriminator learns to distinguish the generated fake data.The goal is to have the generator generate fake data that eventually cannot be differentiated by the discriminator. These two NNs are trained iteratively and compete with each other in a minimax game. The GAN’s operations involve a new operator, called transposed convolution (also known as deconvolution or fractionally strided convolution). Compared with the original convolution,transposed convolution performs up-sampling with a lot of zeros inserted into the feature maps. Redundant computation will be introduced if the transposed convolution is mapped straightforwardly. Techniques are also needed to deal with nonstructural memory access and irregular data layout if zeros are bypassed. In summary, compared with the stand-alone DNN/CNN inference accelerators discussed in Section 4, GAN accelerators must① support training, ② accommodate transposed convolution,and ③optimize the nonstructural data access.

    Fig.13. Structural sparsity: filter-wise,channel-wise,shape-wise, and depth-wise sparsity,as shown in Ref. [45]. W:DNN weights; l:the index of weight tensor; cl: output channel length; nl: input channel length; ml: kernel height; kl: kernel width; ": " is a placeholder.

    Fig.14. Computation sharing pipeline in ReGAN [23]. T: logical cycle; δD: error map of discriminator; δG: error map of generator;?WD:partial derivative of the weights in discriminator;?WG: partial derivative of the weights in generator;?bG: partial derivative of the bias in discriminator; ?bD: partial derivative of the bias in generator; IP:input project layer; FCNN: fractional-strided convolution layer.

    ReGAN proposes a ReRAM-based PIM GAN architecture[23].As shown in Fig. 14, a dedicated pipeline is designed for layer-wise computation in order to increase the system throughput.Two techniques,called spatial parallelism and computation sharing,are proposed to further improve training efficiency.LerGAN[55]proposes a zero-free data-reshaping scheme to remove the zero-related computation in ReRAM-based PIM GAN architecture. A reconfigurable interconnection scheme is proposed to reduce the data transfer overhead.

    For a CMOS-based GAN accelerator, previous work [56] proposed efficient dataflow for different steps in a GAN; that is,zero-free output stationary (ZFOST) for feed-forward/backward propagation, and zero-free weight stationary (ZFWST) for the weight update. GANAX [57] proposed a unified SIMD-multiple instruction multiple data (MIMD) accelerator to maximize the efficiency of both the generator and the discriminator: The SIMD-MIMD mode is used in selective executions due to the zero insertion in the generator, while the pure SIMD mode is used to operate the conventional CNN in the discriminator.

    6.4. Recurrent neural network

    The RNN has many variants, including gated recurrent units(GRU)and LSTM.The recurrent property of the RNN leads to complicated data dependency, in comparison with the conventional DNN/CNN.

    ESE[58]demonstrated an accelerator dedicated to sparse LSTM.A load-balance-aware pruning is proposed to ensure high hardware utilization. A scheduler is designed to encode and partition the compressed model to multiple PEs for parallelism and schedule the LSTM dataflow. DNPU [59] presented an 8.1TOPS/W reconfigurable CNN-RNN system-on-chip (SoC) DeltaRNN [60]leveraged the RNN delta network update approach to reduce memory access:The output of a neuron updates only when the neuron’s activation changes by more than a delta threshold.

    7. The future of DNN accelerators

    In this section, we share our perspectives about the future of DNN accelerators. We discuss three possible future trends:①DNN training and accelerator arrays, ②ReRAM-based PIM accelerators, and ③DNN accelerators on edge devices.

    7.1. DNN training and accelerator arrays

    Currently, almost all DNN accelerator architectures focus on optimization for DNN inference within the accelerator itself, and very few considered the training support [22]. The presumption is that the DNN model has already been trained before deployment. Very few accelerator architectures exist that can support DNN training.Following the increase of the size of training datasets and NNs,a single accelerator is no longer capable of supporting the training of a large DNN.It is inevitably necessary to deploy an array of accelerators or multiple accelerators for the training of DNNs.

    A hybrid parallel structure for DNN training on an accelerator array is proposed in Ref. [61]. The communication between the accelerators dominates the time and energy consumption in DNN training on an accelerator array. Ref. [61] proposes a communication model to identify where the data communication is generated and how large the traffic is. Based on the communication model,layer-wise parallelism is optimized to minimize the total communication and improve the system performance and energy efficiency.

    7.2. ReRAM-based PIM accelerator for DNNs

    Current ReRAM-based accelerators, such as those described in Refs. [21,22,25,62], assume ideal memristor cells.However,realistic challenges such as process variation [63,64], circuit noise[65,66],retention issues,and endurance issues[67-69]greatly hinder the realization of ReRAM-based accelerators. There are also very few silicon proofs of ReRAM-based accelerators and advanced architectures such as PIM, except for that provided in Ref. [70]. In practical ReRAM-based DNN accelerator designs, these non-ideal factors must be taken into consideration.

    7.3. DNN accelerators on edge devices

    In edge-cloud DNN applications, the computational and memory-intensive parts (e.g., training) are often offloaded onto the powerful GPUs in the cloud, and only certain light inference models are deployed on edge devices (e.g., the IoT or mobile devices).

    Following the rapid increase in data acquisition scale, it has become desirable to have intelligent edge devices that are capable of adaptively learning or fine-tuning their DNN models for certain tasks. For example, in wearable applications that monitor the health of users, it will be helpful to adapt the CNN models locally rather than sending the sensed health data back to the cloud,due to significant data communication overhead and privacy issues. In other applications, such as robots, drones, and autonomous vehicles, statically trained models cannot efficiently handle the time-varying environmental conditions.

    However, the long data-transmission latency of sending huge amounts of environmental data to the cloud for incremental training is often unacceptable.More importantly,many real-life scenarios require real-time execution of multiple tasks and dynamic adaptation capability [58]. Nevertheless, it is extremely challenging to perform learning in edge devices because of their stringent computing resources and tight power budget. RedEye [71] is an accelerator for DNN processing on the edge, where the computation is integrated with sensing. Designing lightweight, real-time,and energy-efficient architectures for DNNs on the edge is an important research direction to pursue next.

    Acknowledgements

    This work was supported in part by the National Science Foundations (NSFs) (1822085, 1725456, 1816833, 1500848,1719160, and 1725447), the NSF Computing and Communication Foundations (1740352), the Nanoelectronics COmputing REsearch Program in the Semiconductor Research Corporation (NC-2766-A), the Center for Research in Intelligent Storage and Processing-in-Memory, one of six centers in the Joint University Microelectronics Program, a SRC program sponsored by Defense Advanced Research Projects Agency.

    Compliance with ethics guidelines

    Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, and Tianqi Tang declare that they have no conflicts of interest or financial conflicts to disclose.

    亚洲男人天堂网一区| 99久久人妻综合| 热re99久久国产66热| 日本一区二区免费在线视频| 国产欧美亚洲国产| 乱人伦中国视频| 超碰成人久久| 久久久精品94久久精品| 波多野结衣一区麻豆| 女性生殖器流出的白浆| av在线老鸭窝| 街头女战士在线观看网站| 免费日韩欧美在线观看| 在线亚洲精品国产二区图片欧美| 九色亚洲精品在线播放| 操美女的视频在线观看| 色精品久久人妻99蜜桃| 成人免费观看视频高清| 国产在线一区二区三区精| 免费日韩欧美在线观看| 亚洲熟女毛片儿| 亚洲四区av| 久久影院123| 精品久久久久久电影网| 99精国产麻豆久久婷婷| 1024香蕉在线观看| 91成人精品电影| 老司机在亚洲福利影院| 亚洲,一卡二卡三卡| 极品人妻少妇av视频| 亚洲精品,欧美精品| 午夜免费观看性视频| 99久久99久久久精品蜜桃| 男女边吃奶边做爰视频| 日韩伦理黄色片| 叶爱在线成人免费视频播放| 久久久精品免费免费高清| 亚洲成人一二三区av| 国产免费一区二区三区四区乱码| 久久久精品94久久精品| 七月丁香在线播放| 丝袜人妻中文字幕| 一区二区三区精品91| 一级,二级,三级黄色视频| 99精品久久久久人妻精品| 捣出白浆h1v1| 国产熟女欧美一区二区| 天天躁夜夜躁狠狠久久av| 亚洲精品自拍成人| 日韩制服骚丝袜av| 成人漫画全彩无遮挡| 一区二区三区乱码不卡18| 久久久久久久精品精品| bbb黄色大片| 亚洲av欧美aⅴ国产| 国产在线视频一区二区| 日日啪夜夜爽| 国产精品久久久久久精品古装| 亚洲欧美清纯卡通| 黄色怎么调成土黄色| 日日撸夜夜添| 亚洲一卡2卡3卡4卡5卡精品中文| 成人毛片60女人毛片免费| 亚洲av日韩精品久久久久久密 | 亚洲欧美成人综合另类久久久| 一级,二级,三级黄色视频| 在线天堂中文资源库| 伊人久久国产一区二区| 国产黄色视频一区二区在线观看| 咕卡用的链子| svipshipincom国产片| bbb黄色大片| 男女边摸边吃奶| 性高湖久久久久久久久免费观看| 精品一品国产午夜福利视频| 999精品在线视频| 日韩中文字幕欧美一区二区 | 午夜精品国产一区二区电影| 一区二区三区乱码不卡18| 中文精品一卡2卡3卡4更新| 哪个播放器可以免费观看大片| 成年美女黄网站色视频大全免费| 精品一品国产午夜福利视频| 中文字幕亚洲精品专区| 国产精品熟女久久久久浪| av国产精品久久久久影院| 多毛熟女@视频| 国产精品亚洲av一区麻豆 | 欧美日韩国产mv在线观看视频| 伦理电影免费视频| 成人黄色视频免费在线看| 亚洲一码二码三码区别大吗| 深夜精品福利| 美女中出高潮动态图| 亚洲美女搞黄在线观看| 免费在线观看视频国产中文字幕亚洲 | 欧美日韩福利视频一区二区| 香蕉国产在线看| 免费av中文字幕在线| 一边亲一边摸免费视频| 七月丁香在线播放| 久久这里只有精品19| 看非洲黑人一级黄片| 中文字幕人妻熟女乱码| 欧美日韩视频精品一区| 一区二区三区精品91| 超色免费av| 国产成人精品无人区| 国产精品av久久久久免费| 大香蕉久久网| 男女床上黄色一级片免费看| 赤兔流量卡办理| 热99久久久久精品小说推荐| 国产成人系列免费观看| 丁香六月天网| 久久久国产欧美日韩av| 亚洲美女搞黄在线观看| 超碰97精品在线观看| 亚洲熟女毛片儿| 亚洲国产精品成人久久小说| 老司机影院成人| 欧美 亚洲 国产 日韩一| 精品国产乱码久久久久久小说| 亚洲七黄色美女视频| 国产av码专区亚洲av| 国产一区二区三区av在线| 欧美中文综合在线视频| 亚洲欧美一区二区三区久久| 国产成人a∨麻豆精品| 日日摸夜夜添夜夜爱| 高清不卡的av网站| 日韩,欧美,国产一区二区三区| 日本91视频免费播放| 久久精品国产a三级三级三级| 飞空精品影院首页| 亚洲精品国产区一区二| 久久久久久免费高清国产稀缺| 一边摸一边做爽爽视频免费| 免费黄色在线免费观看| 国产精品熟女久久久久浪| 久久久久网色| 777久久人妻少妇嫩草av网站| 成年人午夜在线观看视频| 老司机在亚洲福利影院| 免费在线观看视频国产中文字幕亚洲 | 午夜老司机福利片| 免费高清在线观看视频在线观看| 久久久亚洲精品成人影院| 国产成人精品久久久久久| 亚洲精品美女久久av网站| 99re6热这里在线精品视频| 国产成人精品久久二区二区91 | 制服丝袜香蕉在线| 亚洲精品美女久久久久99蜜臀 | 在线亚洲精品国产二区图片欧美| 国产乱来视频区| 精品少妇一区二区三区视频日本电影 | 亚洲欧洲日产国产| 国产一区二区在线观看av| 亚洲美女黄色视频免费看| 久久天躁狠狠躁夜夜2o2o | 欧美黑人欧美精品刺激| 国产色婷婷99| 999久久久国产精品视频| 日韩一区二区三区影片| 在线观看www视频免费| 曰老女人黄片| 性色av一级| 国产淫语在线视频| 亚洲伊人色综图| 黄色视频在线播放观看不卡| 天堂8中文在线网| 国产亚洲最大av| 免费久久久久久久精品成人欧美视频| 女人精品久久久久毛片| 亚洲免费av在线视频| 日韩大片免费观看网站| 国产成人欧美在线观看 | 青春草国产在线视频| 国产精品熟女久久久久浪| a 毛片基地| 侵犯人妻中文字幕一二三四区| 老司机在亚洲福利影院| 亚洲欧美日韩另类电影网站| 9热在线视频观看99| 一二三四在线观看免费中文在| 妹子高潮喷水视频| 欧美日韩亚洲高清精品| 国产日韩欧美亚洲二区| 日韩精品免费视频一区二区三区| 亚洲第一青青草原| 成年人免费黄色播放视频| 日韩精品免费视频一区二区三区| 18禁观看日本| 黄色视频在线播放观看不卡| 男人爽女人下面视频在线观看| 国产精品一二三区在线看| 久久久久久久大尺度免费视频| 免费黄频网站在线观看国产| 女人久久www免费人成看片| 欧美精品一区二区免费开放| 精品一区在线观看国产| 国产亚洲av高清不卡| 女的被弄到高潮叫床怎么办| 女的被弄到高潮叫床怎么办| 国产一区有黄有色的免费视频| 国产av精品麻豆| 最新的欧美精品一区二区| 久久久久精品久久久久真实原创| 欧美日韩亚洲国产一区二区在线观看 | 精品久久蜜臀av无| 欧美少妇被猛烈插入视频| 亚洲精品av麻豆狂野| 色婷婷久久久亚洲欧美| 国产成人欧美在线观看 | 老司机亚洲免费影院| 老司机在亚洲福利影院| 最近手机中文字幕大全| 中文字幕色久视频| 亚洲久久久国产精品| av网站在线播放免费| 亚洲成人手机| 亚洲激情五月婷婷啪啪| 亚洲欧美一区二区三区黑人| 久久国产精品大桥未久av| 亚洲欧洲国产日韩| 精品一区二区三区av网在线观看 | 色网站视频免费| 人成视频在线观看免费观看| 久久精品亚洲av国产电影网| 少妇被粗大猛烈的视频| 亚洲一级一片aⅴ在线观看| 免费av中文字幕在线| 91老司机精品| 老司机在亚洲福利影院| 欧美黑人精品巨大| 女人久久www免费人成看片| 51午夜福利影视在线观看| 亚洲av中文av极速乱| 人人妻人人澡人人看| 国产男人的电影天堂91| 精品少妇久久久久久888优播| 色播在线永久视频| 国产精品一区二区精品视频观看| 超碰97精品在线观看| 欧美精品高潮呻吟av久久| 国产成人一区二区在线| 国产成人精品在线电影| 永久免费av网站大全| 在线观看免费日韩欧美大片| 操出白浆在线播放| 美女扒开内裤让男人捅视频| 青青草视频在线视频观看| 王馨瑶露胸无遮挡在线观看| 色婷婷久久久亚洲欧美| 18禁动态无遮挡网站| 在线观看人妻少妇| 欧美激情 高清一区二区三区| 欧美最新免费一区二区三区| 国产精品 国内视频| 午夜老司机福利片| 国产高清国产精品国产三级| 毛片一级片免费看久久久久| www日本在线高清视频| 一边摸一边做爽爽视频免费| 三上悠亚av全集在线观看| av在线老鸭窝| 一区福利在线观看| 成人亚洲精品一区在线观看| 亚洲av中文av极速乱| 七月丁香在线播放| 9色porny在线观看| 天堂8中文在线网| 国产精品熟女久久久久浪| 午夜福利一区二区在线看| 亚洲av欧美aⅴ国产| 午夜免费鲁丝| 伊人久久国产一区二区| 成年美女黄网站色视频大全免费| 男人操女人黄网站| 日韩中文字幕视频在线看片| 久久久久精品国产欧美久久久 | 亚洲国产毛片av蜜桃av| 十八禁高潮呻吟视频| 久久天堂一区二区三区四区| 亚洲中文av在线| 亚洲一码二码三码区别大吗| 亚洲美女黄色视频免费看| 中国三级夫妇交换| 少妇猛男粗大的猛烈进出视频| 亚洲精品第二区| 丝袜喷水一区| www.av在线官网国产| 国产精品 国内视频| 少妇的丰满在线观看| 国产成人av激情在线播放| 欧美日韩精品网址| videos熟女内射| 国产欧美亚洲国产| 亚洲精品aⅴ在线观看| 青草久久国产| av网站在线播放免费| 毛片一级片免费看久久久久| 国产片特级美女逼逼视频| 九九爱精品视频在线观看| 久久亚洲国产成人精品v| 高清黄色对白视频在线免费看| 亚洲国产av新网站| 人人澡人人妻人| 久久久久网色| 91成人精品电影| 成人漫画全彩无遮挡| 好男人视频免费观看在线| 婷婷成人精品国产| 欧美97在线视频| 久久久国产精品麻豆| 男女免费视频国产| 日本色播在线视频| 免费不卡黄色视频| 成人漫画全彩无遮挡| 免费在线观看视频国产中文字幕亚洲 | 丝袜喷水一区| 国产成人精品福利久久| 亚洲欧美一区二区三区黑人| 最近最新中文字幕免费大全7| 久久韩国三级中文字幕| 晚上一个人看的免费电影| 国产又色又爽无遮挡免| 一级片免费观看大全| 久久人人爽av亚洲精品天堂| 高清在线视频一区二区三区| 国产精品av久久久久免费| 国产成人91sexporn| 国精品久久久久久国模美| 老司机靠b影院| 亚洲精品美女久久av网站| av福利片在线| 在线看a的网站| 国产一区亚洲一区在线观看| 一本大道久久a久久精品| 熟妇人妻不卡中文字幕| 岛国毛片在线播放| 亚洲欧美成人综合另类久久久| 国产免费视频播放在线视频| 国产一级毛片在线| 精品酒店卫生间| 久久这里只有精品19| 国产成人a∨麻豆精品| a级片在线免费高清观看视频| 国产在线视频一区二区| 操出白浆在线播放| 精品少妇内射三级| 黑人巨大精品欧美一区二区蜜桃| 成人免费观看视频高清| 亚洲国产av新网站| 咕卡用的链子| 免费少妇av软件| 大香蕉久久成人网| 精品少妇内射三级| 成人午夜精彩视频在线观看| 如日韩欧美国产精品一区二区三区| 99九九在线精品视频| 亚洲精品久久成人aⅴ小说| 香蕉国产在线看| 观看美女的网站| 下体分泌物呈黄色| 国产淫语在线视频| 纵有疾风起免费观看全集完整版| xxxhd国产人妻xxx| 黑人巨大精品欧美一区二区蜜桃| 成年人免费黄色播放视频| 如日韩欧美国产精品一区二区三区| 国产男女内射视频| 中文乱码字字幕精品一区二区三区| 日韩 欧美 亚洲 中文字幕| 精品福利永久在线观看| 日韩中文字幕视频在线看片| 免费日韩欧美在线观看| 男人添女人高潮全过程视频| 永久免费av网站大全| 中文精品一卡2卡3卡4更新| 我要看黄色一级片免费的| 亚洲成人免费av在线播放| 国产av精品麻豆| 国产精品人妻久久久影院| 亚洲四区av| 80岁老熟妇乱子伦牲交| 午夜福利网站1000一区二区三区| 国产男女内射视频| 久久人人爽人人片av| 深夜精品福利| 青草久久国产| 最近中文字幕高清免费大全6| 性色av一级| 国产伦理片在线播放av一区| 街头女战士在线观看网站| 午夜免费鲁丝| 狠狠精品人妻久久久久久综合| 亚洲av日韩精品久久久久久密 | 国产色婷婷99| 欧美日韩亚洲综合一区二区三区_| 精品国产乱码久久久久久小说| 亚洲国产欧美网| 黄色 视频免费看| 五月开心婷婷网| 午夜老司机福利片| 美女福利国产在线| 国产精品嫩草影院av在线观看| 中文字幕亚洲精品专区| 久久精品aⅴ一区二区三区四区| 久久久久久久久久久免费av| 精品一区二区三区四区五区乱码 | 男女免费视频国产| 亚洲综合色网址| 侵犯人妻中文字幕一二三四区| 欧美黄色片欧美黄色片| 精品国产乱码久久久久久小说| 久久国产亚洲av麻豆专区| 国产乱人偷精品视频| 国产精品av久久久久免费| 精品久久蜜臀av无| 水蜜桃什么品种好| 亚洲国产精品999| 中文字幕人妻熟女乱码| 一区二区三区乱码不卡18| 亚洲精品一二三| 2018国产大陆天天弄谢| 蜜桃国产av成人99| 国产成人精品在线电影| 自拍欧美九色日韩亚洲蝌蚪91| 中文乱码字字幕精品一区二区三区| 日韩欧美一区视频在线观看| 亚洲一区二区三区欧美精品| 国产亚洲午夜精品一区二区久久| 999精品在线视频| 精品国产露脸久久av麻豆| √禁漫天堂资源中文www| 捣出白浆h1v1| 日本vs欧美在线观看视频| 国产精品久久久久久精品电影小说| 国产不卡av网站在线观看| 欧美日韩福利视频一区二区| 日本欧美国产在线视频| 五月开心婷婷网| 1024视频免费在线观看| 丰满迷人的少妇在线观看| 国产精品久久久久久久久免| avwww免费| 免费高清在线观看视频在线观看| 亚洲国产最新在线播放| 岛国毛片在线播放| 国产亚洲最大av| 亚洲美女搞黄在线观看| 美女大奶头黄色视频| 人人澡人人妻人| 亚洲精品久久成人aⅴ小说| 制服诱惑二区| 国产黄色视频一区二区在线观看| 男女之事视频高清在线观看 | 18禁国产床啪视频网站| 久久精品久久久久久噜噜老黄| 亚洲综合精品二区| 久久久精品免费免费高清| 亚洲色图 男人天堂 中文字幕| av卡一久久| 热99久久久久精品小说推荐| 精品一区二区三区四区五区乱码 | 日韩av免费高清视频| 亚洲美女搞黄在线观看| 午夜福利网站1000一区二区三区| 久久影院123| 国产精品麻豆人妻色哟哟久久| 在线观看免费日韩欧美大片| 人妻人人澡人人爽人人| 久久av网站| 2018国产大陆天天弄谢| 午夜精品国产一区二区电影| 中文天堂在线官网| 在线观看免费高清a一片| 少妇被粗大的猛进出69影院| 亚洲国产av新网站| 久久精品久久久久久久性| 亚洲欧美一区二区三区黑人| 啦啦啦视频在线资源免费观看| 久热这里只有精品99| 国产97色在线日韩免费| 精品国产超薄肉色丝袜足j| 男女之事视频高清在线观看 | 亚洲一卡2卡3卡4卡5卡精品中文| 美女国产高潮福利片在线看| 成人亚洲精品一区在线观看| 中文字幕最新亚洲高清| 无遮挡黄片免费观看| 别揉我奶头~嗯~啊~动态视频 | 亚洲精品久久久久久婷婷小说| 高清欧美精品videossex| 国产精品二区激情视频| 夫妻午夜视频| 波多野结衣av一区二区av| 婷婷色综合大香蕉| a 毛片基地| 免费女性裸体啪啪无遮挡网站| 日日爽夜夜爽网站| 亚洲第一青青草原| 高清av免费在线| 久久精品国产a三级三级三级| 国产精品偷伦视频观看了| 久久久久网色| 国产精品av久久久久免费| 免费观看性生交大片5| avwww免费| 亚洲国产av影院在线观看| 久久精品亚洲熟妇少妇任你| 国产xxxxx性猛交| 国产成人精品久久久久久| 波多野结衣一区麻豆| 午夜av观看不卡| 欧美精品av麻豆av| 黄网站色视频无遮挡免费观看| 亚洲成人国产一区在线观看 | 久久精品久久久久久噜噜老黄| 欧美日韩精品网址| 叶爱在线成人免费视频播放| 亚洲色图综合在线观看| 18禁国产床啪视频网站| 一级片'在线观看视频| 国产乱人偷精品视频| 欧美日韩视频精品一区| 亚洲欧美一区二区三区国产| 在线天堂最新版资源| 一二三四中文在线观看免费高清| 午夜免费观看性视频| 亚洲av福利一区| h视频一区二区三区| av天堂久久9| kizo精华| 日韩av免费高清视频| 少妇精品久久久久久久| 精品国产一区二区久久| 黄色 视频免费看| 欧美亚洲 丝袜 人妻 在线| 9热在线视频观看99| 免费女性裸体啪啪无遮挡网站| 久久韩国三级中文字幕| 在线精品无人区一区二区三| 国产片内射在线| 十八禁高潮呻吟视频| 国产极品粉嫩免费观看在线| 免费看av在线观看网站| 色精品久久人妻99蜜桃| 国产精品熟女久久久久浪| 黄色视频不卡| 精品一区在线观看国产| 国产淫语在线视频| 久久久久视频综合| 人人妻,人人澡人人爽秒播 | 亚洲在久久综合| 丰满乱子伦码专区| 午夜免费观看性视频| av网站在线播放免费| 国产精品久久久av美女十八| 精品国产一区二区久久| 三上悠亚av全集在线观看| 一区在线观看完整版| 一二三四在线观看免费中文在| 永久免费av网站大全| 老司机影院成人| 午夜福利网站1000一区二区三区| av不卡在线播放| 国产熟女欧美一区二区| 欧美日韩国产mv在线观看视频| 成人免费观看视频高清| 91精品国产国语对白视频| 精品国产露脸久久av麻豆| 亚洲精品日本国产第一区| 一级,二级,三级黄色视频| 亚洲精品久久成人aⅴ小说| 国产男人的电影天堂91| 亚洲欧美色中文字幕在线| 精品一区二区免费观看| 国产欧美日韩一区二区三区在线| 国产欧美日韩综合在线一区二区| videos熟女内射| 欧美国产精品一级二级三级| 热99久久久久精品小说推荐| 国产亚洲最大av| 国产又色又爽无遮挡免| 国产精品女同一区二区软件| 一本一本久久a久久精品综合妖精| 天堂8中文在线网| 亚洲男人天堂网一区| 制服人妻中文乱码| 深夜精品福利| 国精品久久久久久国模美| 亚洲国产日韩一区二区| 一区二区三区四区激情视频| 欧美人与善性xxx| 欧美日韩视频高清一区二区三区二| 深夜精品福利| 精品一区二区三区av网在线观看 | 国产亚洲一区二区精品| 两性夫妻黄色片| 老司机靠b影院| 美女主播在线视频| 韩国精品一区二区三区| 成人国语在线视频| 美女午夜性视频免费| av网站在线播放免费| 日韩欧美一区视频在线观看| 亚洲精品国产色婷婷电影| 18禁国产床啪视频网站| xxxhd国产人妻xxx| 五月开心婷婷网| 一本久久精品| 满18在线观看网站|