• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    PRF: a process-RAM-feedback performance model to revealbottlenecks and propose optimizations①

    2020-10-09 07:38:18XieZhenTanGuangmingLiuWeifengSunNinghui
    High Technology Letters 2020年3期

    Xie Zhen(謝 震)②, Tan Guangming, Liu Weifeng , Sun Ninghui

    (*Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, P.R.China) (* *University of Chinese Academy of Sciences, Beijing 100190, P.R.China) (* * *College of Information Science and Engineering, China University of Petroleum, Beijing 102249, P.R.China)

    Abstract

    Key words: performance model, feedback optimization, convolution, sparse matrix-vector multiplication, sn-sweep

    0 Introduction

    In recent decades, modeling has been employed to optimize performance of computational kernel and verify the validity of proposed optimization methods[1-4]. Depending on whether the model uses hardware and software features as a reference, it can be roughly divided into 2 categories. One is ‘black box’ model that uses a fitting or machine learning method to predict the performance by extracting characteristics of the target machine and collecting data of application. This method collects a large amount of application data to build a model by statistics or mathematical model. The model is not universal and does not reflect the real implementation inside architecture. The other is ‘white box’ model, which uses a simplified machine model to describe the matching relationship between application and hardware. The simplest white-box model was the Roofline model proposed by Ilic et al.[5], which can be used to bound floating-point performance with a function of machine peak performance, peak bandwidth and arithmetic intensity. However, the Roofline model cannot describe detailed bottlenecks beyond memory bandwidth and peak performance. It only shows the upper limits of the performance. A more detailed machine model with memory hierarchy was the execution-cache-memory (ECM) model proposed by Datta et al.[6]. Based on ECM model, stencil[7]loop kernel[8]and microbenchmark[9,10]have been modeled and optimized. It divides the machine into in-core and out-core phase, and reflects the time of instruction execution on processor core and data transfer between cache and memory by an address-based cache simulator (pycachesim) or proportion analysis by Kerncraft[11]. But mapping real application to address sequence requires a medium and none of these tools have yet been provided. Also this model ignores the impact for possible occurrence of data dependence and the differences between regular and irregular memory access. So it cannot accurately predict performance when the application has pipeline stall and stride or random memory access. Therefore, it cannot give specific feedback based on the incomplete information.Moreover, for specific applications, optimization methods are very different with different instructions and data sizes, the existing models cannot give fine-grained optimization guidance. In addition, there have been many work based on hardware performance counters[12-15]. Although these methods can also obtain the load characteristics by the number of monitored events, they need to monitor at runtime with unacceptable overhead.

    This paper proposes a new performance model called process-RAM-feedback (PRF). It deeply considers the diversities of the pipeline and the cache layers as a ‘white box’ model. So that it can predict the instruction overhead and the cache misses for regular, irregular and data dependence applications. In this sense, the proposed model largely broadens the application domain of performance modeling. More importantly, the proposed model can give feedback guidance for the optimization, which is not available in existing performance models. In order to realize this functionality, the proposed model consists of 3 fundamental steps. Firstly, the proposed model abstracts the important hardware parameters, such as the calculation unit, access unit, instruction cost, the associativity and the size of each memory level (cache and DRAM). Secondly, based on these parameters, this paper constructs a directed acyclic graph (DAG) to predict instruction overhead and a cache simulator to predict the cache miss as the overhead of data transmission. Lastly, the proposed model can use the obtained information in the previous 2 steps, reveal the bottlenecks, compare the effects of different optimization methods, and then provide optimization guidances. In this way, the optimized kernels are obtained. This work’s main contribution is reflected in 3 aspects.

    1) In the proposed PRF model, this paper considers comprehensive factors which may affect the kernel performance, e.g., the probability of instruction pipeline stall and cache line transmission between different levels of RAM, and the execution time of instructions for both calculation and memory access. Based on the theoretical and model-based analysis for specific cases, the proposed model can produce a much more accurate performance prediction and extend the application domain to the context of irregular memory access and data dependency issues.

    2) In order to get the number of cache misses at all cache layers for irregular case, this paper develops a multi-level cache simulator which can be easily built on the target hardware and quickly get the needed information for the specific kernels. As an indicator of data transmission, improvement of prediction accuracy for cache miss is a crucial factor which affects the accuracy of performance prediction in the proposed model.

    3) According to the different performance bottlenecks, the proposed model feeds back developers to select the best optimization method and informs performance expectation with various data inputs, instructions support and data sizes.

    From regular memory access to irregular and data dependence, this paper selects 3 cases (convolution, sparse matrix-vector multiplication and sn-sweep) to verify the correctness of the model. The experiments show that, for modeling the convolution algorithm as the regular case, the PRF model could feed back the best optimization suggestion for various data sizes under current instruction support.For the irregular memory access, by running SpMV with 3 673 sparse matrices from the Florida collection[16], the average error rate: ABS(predicted_value-measured_value) / measured_value of PRF model is about quarter than the ECM model. Further, for sparse matrices, feedback optimization guides developer to select the parameter (i.e., block size) for the block compressed sparse row (CSR) sparse matrix format by using the output of the PRF model and it highly matches the best block size manually selected from exhaustive experiments. As for sn-sweep, PRF model constructs the DAG through its calculation process, and analyzes the data dependency relationship between instructions to reveal the problem of pipeline with out-of-order execution support, and finally guides to solve data dependence issue and its performance improvement is about 192% in single-core pipeline with linear scalability for multi-core.

    1 Related work

    Performance modeling is a very useful technique for optimization of parallel applications on high performance computing (HPC) platforms. All of the current architecture can be divided into the instruction part and the memory part. As shown in Fig.1, it results different abstract methods can build various performance models. Next this paper begins to introduce the differences between 4 performance models.

    The Roofline[5]model is a visual analytical model used to pinpoint performance bottlenecks. For the instruction part, the Roofline model abstracts it as a black box, this model describes the peak computing performance as the upper limit of the instruction part, which constraints the best performance of which program can achieve. Same as before, the Roofline model also abstracts memory part as the peak memory bandwidth. It can also be applied to any program. However, the model gives little detail information, and lacks accuracy of performance prediction for target application.

    Fig.1 PRF model and comparison with Roofline, cache-aware Roofline and ECM

    As a refined model on Roofline, cache-aware Roofline[17]joins the implications of the cache hierarchy, and the transmission bandwidths of cache are used as boundaries. But this model has the same disadvantages as Roofline by low accuracy.

    The ECM[6]model considers the time for executing the instructions with data coming from the L1 cache as well as the time for moving the required cache lines (CLs) through the cache hierarchy. It also calculates the time for executing instructions of loop kernel on the processor core (assuming no cache miss) and transferring data between its initial location and the L1 cache. The in-core execution time T_core is determined by the unit that takes the most cycles to execute the instructions. The time needed for all data transfers required to execute one work unit is expressed as T_trans. The in-core execution and transfer costs must be put together to arrive at a prediction of single-thread execution time. By simulating execution of CPU instruction, the model has a good prediction accuracy for regular memory access pattern. But ECM could predict performance in the memory part of a sort, it divides memory part into an undifferentiated cache layer and main memory, making it hard to model data locality.

    Therefore, for irregular application, these 3 models are difficult to give accurate prediction, and the error mainly occurs in the memory access phase, and there is a considerable gap between the size of data transfer and the data reuse of irregular application by these model predicts and real situation. Also the ECM model ignores the effect of pipeline stall. This is also the main focus of the proposed model. Here this paper realizes a pipeline DAG with more realistic memory hierarchy, and designs a cache simulator to output approximately real cache miss for irregular case. With the abstract method, PRF model can achieve better predictive accuracy, while also covering more irregular and data dependence cases. The comparison of all models is shown in Fig.1.

    Cache simulation has been used to evaluate memory systems for decades. Fei et al.[18]examined the topic for multiprocessor memory-systems. Liu et al.[19]investigated an x86 cache simulation framework in the specific context of multiprocessor systems. Llatser et al.[20]discussed many intersection properties for caches using different replacement policies, and Dey et al.[21]studied coherency protocols between different caches in a shared memory system. These simulators are constructed by configurations, such as cache size, sets, cache associativity, block size, replacement policy according to a specific application. However, for program optimization, these cache or memory simulators are not efficient enough since the cost for lots of hardware emulation functions. Thus applying them in real-world applications will lead to new performance bottlenecks. This paper designs a light weight cache simulator that can simulate all levels of cache miss with a very low overhead for all memory access patterns.

    As for regular memory access pattern, convolution operation in convolution neural networks is the most time-consuming function. Brechet et al.[22,23]used the vectorization instruction or designed the unique hardware structure to speed up the calculation, but this optimization is case by case, since the effective optimization method for different data sizes and machine structures is quite different, so the proposed model is to find a set of methods to optimize the regular memory access applications in different architectures and guides the developer to obtain higher performance.

    Sparse matrix-vector (SpMV,y=Ax) multiplication is an important computational kernel with an irregular memory access patterns. The SpMV operation does not contain any dependencies and has relatively high parallelism. Buttari et al.[24]used a modeling method to optimize the SpMV by blocking. But this model fits machine characteristics and predicts performance by fill-in the dense matrix to achieve the best block sizes. Li et al.[25]used register-level tiling opportunities to select parameters for BCSR by model. However, these parameters tuning are time-consuming and none of these methods can reflect real execution on the current architecture of SpMV. Then this paper quantifies instruction and memory transmission for various parameters, and selects the best block size and predicts optimized performance.

    Sn-sweep is a typical regular application with very low performance as data dependence. It is the largest time-consuming function for numerical simulation of radiation transport in high energy density plasma physics[26]. As sn-sweep can be seen sweep the radiation flux from the source across the grid in the downstream direction, so grid decomposition method makes task parallelization easier. Liu et al.[27]described message passing implementations of sn-sweep algorithms within a radiation transport package to increase expandability. The proposed model focuses on bottlenecks within a task and exposes the cause of the instruction pipeline stall to guide optimization.

    2 The PRF performance model

    The PRF divides model into 3 parts: process phase, RAM phase and feedback optimization phase. The process phase corresponds to the instruction part, which predicts the cost cycles within the CPU core by executing internal instruction with pipeline, while the RAM phase corresponds to the memory part, which describes the connected relationship and predicts the time between memory hierarchy. The feedback optimization phase collects the output of each phase, analyzes execution bottlenecks and potential opportunities, and provides developer optimization guidance. In Fig.2, this paper visualizes these 3 phases, and sets the time of process phase when all instructions and datas are in the L1 cache, the time of RAM phase is the transfer time between the memory hierarchies. The outputs of model are evaluated, analyzed and directed by feedback optimization phase.

    Fig.2 Three components and workflow of the PRF model

    For process phase, this paper first divides the instruction into computation instruction and memory access instruction, and builds a DAG to describe data dependencies. Through the implementation of the pipeline, it outputs the overhead for 2 kinds of instructions. Then this paper labels the cost time of calculation and memory access instruction as T1 and T2. For the RAM phase, due to the principle of locality, different inputs and applications will result in different cache miss, so memory access instruction does not definitely cause datas transfer (if datas are in cache). Also this paper founds that the current Intel processors have a prefetching mechanism for the regular memory access, resulting in almost no cache miss at some cache levels. So this paper separates the regular and irregular data, and the time of transmission for regular data can be calculated by format which will be described in detail later, and also builds a multi-level cache simulator for irregular data, then simulator can output the number of cache misses at all levels. Therefore this paper can obtain the time of all the data transfer and label the transmission time of 3-level cache as T3, T4, and T5. The feedback optimization phase abstracts 4 performance bottlenecks, uses the output of the model to point out the key factors which affects performance most, and guides developer to improve performance which optimization can bring.

    2.1 Process phase

    Instruction is mainly divided into 2 types: computation (addition and multiplication) and memory access (load and store) instruction. The 2 kinds of instructions are scheduled independently in core internal and can properly implement instruction-level parallelism without data dependence. Although current processor has out-of-order execution mechanism, when data dependency occurs, the instruction which depends on the previous results will also cause performance stall[28].

    In order to model the pipeline, this paper designs a DAG module, which is constructed as shown in the Fig.3. It uses a blank circle representing read without write, a solid line with an arrow representing the direction and load a data, and the grid circle representing read after write. By analyzing the code of an application to build the DAG, execution flow of the DAG shows pipeline analysis of data dependence. For example, on a 3-stage pipeline processor, R3 occurs data dependence for read after write, resulting in one stall for data dependence.

    Fig.3 DAG for modeling data dependence

    As an example, this paper builds the PRF model on an Intel Haswell architecture. This microarchitecture can execute one addition and one multiplication operation or 2 fused multiply add (FMA) operations per cycle, while supporting 2 load operations and one store operation per cycle. All details will be introduced in the Section 3. So if the numbers of addition, multiplication, FMA, load and store instruction are A, M, FMA, L and S respectively for an application, then the time spent on process phase can be calculated by Eq.(1).

    T(processphase)

    (1)

    2.2 RAM phase

    RAM phase converts cost of data transmission into overhead between each level of cache and main memory. As the size of each level of memory hierarchies are very different for various architectures, data may appear in any of them, resulting in complicated data transfer time. When a CPU requests memory access, it will seek a data block as cache line from L1, L2 and L3 caches to main memory in turn. If the former search is missing, this cache line will transfer from the lower level of memory hierarchy to the upper level. At the same time, the current cache has a perfect prefetching mechanism for the regular memory access, resulting in a significant reduction in transmission time.

    Therefore, as in Fig.2, RAM phase model partitions memory accesses into regular and irregular memory access. For regular memory access, it is to access continuous memory address or cache line, or visit fixed stride memory address. For the Haswell core this paper used to model, the L1 cache cannot implement prefetching operation as the first level of cache by lots of experiments. Then for other cache levels, when data is read from L2 cache to L1 cache, the adjacent data which will be accessed is transferred from L3 cache to L2 cache, and since the transmission bandwidth of L3 to L2 is half than L2 to L1, it leads to a half overlap. At the same time, other adjacent data which will be accessed is transferred from the main memory to L3 cache, and the bandwidth of main memory to L3 is lower than the cache, so there will be a delay by the differences.

    Assuming that the internal bandwidth of cache which data appear isC, the bandwidth between the main memory and the last level cache isM, and the regular data size to be accessed isAmount_regular, the time spent on regular memory access pattern is given in Eq.(2).

    T(RAM_phase_for_regular)

    (2)

    For irregular memory access, cache prefetcher cannot work well when accessing disordered address, the data which needs to be accessed may appear on any of the memory hierarchy, so loading these irregular data will produce unexpected cache misses. In order to get these cache misses, this paper designs a cache simulator, which could simulate cache groups and cache lines, and use index number (similar to the address) to distinguish each cache line. Then the simulator builds memory access sequence based on the user’s input data and memory access process, and finally outputs cache misses. It also simulates the replacement mechanism.

    Unfortunately, the Intel smart cache replacement policy is confidential, so this paper carries out a large number of experiments and obtains some conclusions: L1 cache uses least recently used (LRU) policy, L2 and L3 are based on the LRU policy with relatively perfect prefetching operation for regular memory access.

    Therefore, the process of building the cache simulator can be described as follows. (1) Detect the size and group associative of the cache at all levels, and assign corresponding tags which could mark block number and valid bit. (2) Build the mapping relationship at all levels. (3) Set its replacement strategy.

    The internal work-flow of the cache simulator is as follows. (1) Detect current physical machine and build the cache simulator. (2) Partition the input data which will be accessed to corresponding cache lines and mark it as regular or irregular. (3) The cache simulator reads cache lines by marking the cache block valid nor not, prefetchs regular cache lines and records the miss numbers. (4) Finally, through the known data transfer rate, the cache simulator will output cache miss and cost at any of the memory hierarchy. If the 3 cache misses is L1_miss, L2_miss and L3_miss respectively and the size of cache line is CL, the time spent is given in Eq.(3).

    T(RAM_phase_for_irregular)

    (3)

    The sum of process part and RAM part shows the overall completion time of the program. It can be calculated by Eq.(4). And giga floating-point operations per second (GFLOPS) can be calculated by Eq.(5).

    T(PRF)=T(process_phase+RAM_phase-overlap)

    +T(RAM_phase))

    +T(RAM_for_regular)

    +T(RAM_for_irregular))

    (4)

    GFLOPS(PRF)

    =(A+M)×CPU_Frequency/T(PRF)

    (5)

    2.3 Feedback optimization phase

    According to Fig.2 and Eq.(4), this paper splits the time for each stage from T1 to T5.As shown in Fig.4, the abscissa indicates the data size, and the ordinate indicates the GFLOPS. The line T1 represents floating-point performance without memory access time, and the line (T2+T3+T4+T5) represents floating-point performance without calculation time. So the final execution time is the minimum value of line T1 and line (T2+T3+T4+T5). This paper enumerates 4 different possibilities and shows how to discover the bottleneck and optimized code.

    Fig.4 Four different optimization opportunities

    As shown in Fig.4(a), it can be observed the line T1 cannot reach scalar peak performance without data transmission, so feedback optimization guide developer to choice intermediate variable to reduce the data dependence for addition and multiplication operation. Then the arrow represents the reduction of T1 and improvement of GFLOPS by decreasing the time of calculation instruction. Through the changes of feedback optimization, the line with cycle performance line can be upgraded to the line with triangle performance line with the data size increasing. As same for Fig.4(b), the line T1 cannot reach floating point peak performance without data transmission, so feedback optimization guide developer to choice the single instruction multiple data (SIMD) to increase calculation speed. If T2 has spent much more time and become a bottleneck, it presents the data access instruction that has become worthwhile optimization. Feedback optimization informs developer to redesign data structure or increase the redundant space to reduce the time of memory access instruction, and the optimization effect is shown in the Fig.4(c). If this paper finds that the memory access part is bottleneck, then feedback optimization will warn developer to increase cache utilization and reduce the transmission time by using new data format or cache-based block design in the Fig.4(d).

    Next, this paper will use 3 cases to implement the whole process of modeling.

    3 Experimental testbed

    This paper uses an Intel Xeon E5-2680 v3 processor[29]with the SSE, AVX and AVX2 support for validating the proposed model. Each core can execute one multiply and one addition in floating point or 2 FMA instructions per cycle without data dependency. The memory hierarchy consists of 3 on-chip SRAM data caches. For the scalar instruction[30], the memory read and write instruction (load and store) is only one kind, and the data is always aligned. For the vector instruction, memory access instruction is divided into load, loadu, store and storeu by respectively accessing aligned and unaligned data, and the time spend on unaligned data is twice as large as for aligned data. All the specifications are listed in Table 1.

    Table 1 Special machine parameters

    4 Validation

    4.1 Convolution

    4.1.1 Convolution operation

    Convolution operations have been widely used in denoising, extraction, structure smoothing, filtering, detection, image enhancement and many other image processing applications. The information of images is encoded in the spatial domain rather than the frequency domain, thus the image convolution operation is extremely essential and useful in image processing. For real-world applications, 32 bit floating point is usually selected for rapid training. For 1D convolution, the convolution filter is a 1-dimension structure, as horizontal filter with size of 1×N. 1D convolution operation simply rotates the convolution kernel 180 degrees before multiplying the input data. Fig.5 illustrates anm×ninput data convolved with a 1×16 kernel size and its DAG. Each pixel in the window is multiplied by their corresponding kernel coefficients and finally generate the whole output data.

    Fig.5 DAG of 1D convolution

    4.1.2 Performance prediction for naive code

    From the proposed example, it can be seen that the 1D convolution requires reading 16 kernel data and 16 input data to perform 16 addition and multiplication calculations. This paper can easily find out data is reuse access, along one dimension for 1D convolution. The next iteration operation will reuse 15 elements from the previous iterations.

    For this program, the kernel array has good locality, and it can always be stored in L1 cache. Reading a cache line of in and out data separately will update 16 values of out, and contains both 16×16 addition and multiplication instructions, and therefore it also produces 16×16 load and 16 store instructions. For the current architecture, each cycle can execute an add and mul instruction, so the cost cycle of all of those addition and multiplication instructions is 16×16=256, and a cycle can perform two load instructions and one store instruction. So the cost cycle of access instruction is 256/2=128, For L2->L1 and L3->L2, a cycle can transfer one cache line and half of a cache line respectively, namely data transmission requires 1 or 2 cycles for a cache line, For MEM->L3, a cycle can transmit 1/5 cache line, finally this paper can infer the GFLOPS performance in different data sizes which are shown in Table 2, for example, the GFLOSP=(16×16×2)/(max(256,128+2/4/10)/2.7) Gflops.

    Table 2 GFLOPS performance in different data sizes for naive code, and comparison

    From the performance analysis given above, it can be seen that, regardless of the data at any memory hierarchy, the largest cost of a 1D convolution is the computational instruction (T1). Next, this paper will introduce different optimization methods, and analyze the optimized performance.

    4.1.3 Optimization method and modeling analysis

    By comparing the overhead of each phase, this paper can see that the T1 becomes a bottleneck. From the Fig.4, the performance reaches the scalar peak which means there is no data dependence in the calculation process, so the optimal GFLOPS can be achieved when using SIMD by feedback optimization phase in Fig.4(b). For the current machine platform, it supports SSE for 128 bit and AVX2 for 256 bit. So this paper will implement optimized version for AVX2 instruction with unaligned data and give the performance prediction. The pseudo-code is given in Algorithm 1.

    Algorithm 1 AVX2 unroll unalignedInput: IN[], length, KERNEL[], kernellength;Output: OUT[]Input: IN[], length, KERNEL[], kernellength;Output: OUT[]1: m256 kernelreverse[kernellength](aligned);2: for i=0; i

    For 16 iterations of the program, it will produce 16×2 addition and 16×2 multiplication instructions. It also produces 16×2 loadu and 2 storeu instructions and 7 additional addition operations. Therefore, the cost cycle of all addition and multiplication instructions is 16×2+7=39. By using AVX2 instruction and data is unaligned, the cost cycles of access instruction is 16×2×2/2=32. The GFLOPS performance is given in Table 3.

    Table 3 GFLOPS performance in different data sizes for naive code, and comparison

    4.1.4 Optimization guidance

    After the previous experiments, this paper finds that the best optimization scheme depends on the supported instruction set and size of the image data set. From Fig.6, this paper extends to predict performance from SSE to AVX2 with aligned memory access or not for different data sizes. Then this paper can clearly observe that the SSE instruction set can achieve good scalability. When using the AVX2 instruction set, data transmission begins to affect performance with the increase of data size, then there is no way to achieve the peak of floating-point calculation, therefore, using AVX2 instruction set cannot linearly improve performance. As aligned memory access needs addition data storage, the method is only efficient with a small amount of data size. So the feedback guideline recommend developer to use vector instruction with unaligned access and increase the memory transmission optimization with AVX2 support.

    Fig.6 Modeling and prediction of various instruction

    4.2 SpMV

    In this section, this paper will use 3 627 matrices in the Florida sparse matrix library[16]to verify the PRF model and compare with ECM model (as ECM releases Kerncraft and Pycachesim, so this paper could only use these 2 tools and the idea of the ECM paper as much as possible to model SpMV), find bottlenecks and propose optimization scheme. Finally, this paper chooses the optimal implementation with some randomly selected matrices.

    4.2.1 Test matrices

    The sparse matrices this paper used to model cover all the Florida sparse matrix library. And these matrices come from a wide variety of applications with different sparse distribution characteristics. Similar to earlier work on SpMV, the CSR kernel also use row decomposition, loop unrolling and software prefetching. Some randomly selected matrices are used in previous papers[31], and the bar matrix is extracted from a real case of the actual problem. In real-world applications, 64 bit floating point is usually selected for better precision.

    4.2.2 Performance prediction and bottlenecks

    The pseudo-code for CSR-based SpMV is given in the left of Fig.7. For process phase, this paper builds data dependence DAG on pseudo-code, and by using multiple registers, compiler can generate independence code as shown in the right of Fig.7. Then almost all instructions can execute pipeline.

    Fig.7 Pseudo-code and DAG for CSR-based SpMV

    For a sparse matrix, its row, column and nnz is R, C and NNZ, then the size of integer array A.row_ptr is (R+1)×4 Byte, the size of integer array A.col_index is NNZ×4 Byte, the size of double array A.value is NNZ×8 Byte, the size of the double array for vectorXis C×8 Byte. The red place in pseudo-code is multiplication and addition, respectively. There are NNZ ADD and NNZ MUL instructions, while the upper place indicates memory access, A.row_ptr needs (R+1) LOAD instructions, A.col_index needs NNZ LOAD instructions, A.value needs NNZ LOAD instructions, vectorXneeds NNZ LOAD instructions, the output vectorBneeds R STORE instructions, it has a total of NNZ ADD, NNZ MUL, 3×NNZ LOAD and (R+1)STORE instructions. Meanwhile, all instructions can execute pipeline. By Eq.(1), the time spent on calculation units is NNZ cycles and the time spent on memory access units is 1.5×NNZ cycles. The floating-point operations is NNZ(ADD)+NNZ(MUL)=2×NNZ.

    For RAM phase, the main uncertain time cost needs to model is varieties data transmission of vectorXfor kinds of sparse matrices. Then, this paper divides the data of matrix and vectors by the size of cache line and mark, as regular and irregular respectively. According to the RAM phase mentioned in Section 2, this paper builds a cache simulator by reading hardware parameters of the target machine, such as cache sizes and group associations. The simulator reads the marked data blocks in turn and records the cache misses. Therefore it simulates regular memory access of the CSR matrix with prefetching and out-of-order access of the vector.

    The GFLOPS of a special matrix is finally obtained by Eq.(5). As shown in Fig.8, the abscissa of all sub-graphs represents the degree of sparsity of the matrix, and it is sorted from small to large in order, and the ordinate of Fig.8(a) represents the error rate of simulated values to measured data for GFLOPS. When the real value is the same as the predicted value, the error rate is 0%, so the neighborhood of 0% represents that this model can produce fairly accuracy prediction. The ordinate of Fig.8(b), (c) and (d) represents the ratio of simulation to measured data for L1 miss, L2 miss and L3 miss and the rate of 1 represents perfect prediction. The measured L1, L2, and L3 cache misses used PAPI’s[32]statistics.

    Fig.8 Single core PRF model for SpMV on 3 627 sparse matrices

    Through the observations, for Fig.8(a), when the sparsity is less than 0.0002 or greater than 0.00045, about half of the points appears between 0.5 and 1.5 which is slightly better than ECM. When the sparsity is between 0.0002 and 0.00045, the predictions are significantly better than the ECM model, and the error rate of prediction is more than twice as small as the ECM. For sub-graph Fig.8(b), (c) and (d), it is found that most of the prediction data are concentrated near the rate of 1, which is obviously contrasted with ECM, and it is also the main reason why GFLOPS prediction accuracy is better than ECM. But there are several reasons for the predictions not to exactly agree with measurement. (1) The real cache replacement policy is unknown. (2) The actual throughput is not a constant. However, the proposed method can also achieve high prediction accuracy with Eq.(5) and cache simulator.

    4.2.3 Feedback performance optimization

    Analyzing the performance of SpMV, this paper found that the max time and the bottleneck is (T3+T4+T5). As the CSR format matrix results in a large number of repeated data transmission of slice vectorX, by changing the format of sparse matrix, this paper can increase locality of vectorXand reduce matrix index transmission, thus it can reduce the time of data transmission and increase the efficiency of floating-point operation.

    So this paper randomly selects 12 matrices and one engineering matrices-bar. Through modeling, this paper finds that the CSR format causes a high L1 and L2 cache miss, then T3 is about 1.8 times than T4 and T4 is about 3.9 times than T5, resulting in a large time of data transfer. So this paper thinks of ways to optimize (T3+T4).

    For many solutions for cache optimization, register blocking is a typical technique for improving data reuse. The sparse matrix is logically divided into blocks and those blocks usually contain at least one non-zero. SpMV computation proceeds block-by-block. For each block, this paper can reuse the corresponding elements of the vectorXby keeping them in registers to increase temporal locality. Register blocking uses the blocked variant of compressed sparse row storage format and it is also called BCSR for short. Blocks within the same block row are stored consecutively, and the elements of each block are stored consecutively in row-major order. BCSR potentially stores fewer column indices than CSR (one per block instead of one per non-zero). The effect is to reduce memory traffic by reducing index storage overhead and reusing the vector slice. Then the T3 and T4 and even T5 can be reduced. However, a uniform block size may require filling in explicit zero values, resulting in extra computations and data traffic. Based on the above principle, the feedback optimization is implemented based on BCSR format.

    Now, this paper applies the PRF model to the BCSR format. By reading the matrix to cache simulator, this paper can get the cache misses and increase zero elements calculation, and finial put forward the optimal block shapes, then the partitioning scheme is given. In Fig.9, this paper models BCSR format for selected 18

    Fig.9 Overall speedup of test matrices by optimization suggestion

    matrices, gives recommendation with an average speed-up of 173.79% and compares with hand tunning for BCSR. For the rail4284 matrix, modeling result find that the matrix does not have blocking characteristic, so traditional CSR can achieve better performance. Compared to the direct select optimal parameter by 64 SpMV time, the method greatly saves the overhead of selecting the optimal parameter by 12 SpMV time, so the model greatly improves the efficiency of feedback optimization.

    4.3 Sn-sweep

    4.3.1 Sn-sweep operation

    In particle transport simulations, radiation effects are often described by the discrete ordinates (sn) form of Boltzmann equation. In each ordinate direction, the solution is computed by sweeping the radiation flux across the grid. Sn-sweep operations have been widely used in radiation transport, radiation effect and many other high energy density plasma physics applications. For scanning algorithm, sn-sweep is an essential example of calculating the relationship between the influences of adjacent elements. The main process flow of sn-sweep is to sequentially calculate the influence of neighboring elements and update the surrounding elements. The left side of Fig.10 shows a core function of sn-sweep, it needs to multiply the left and upper elements by 2 weights and add the sum to the current element, then one iteration is to complete corresponding calculation of all the mesh.

    Fig.10 Pseudo-code and DAG for sn-sweep

    4.3.2 Performance prediction and bottleneck analysis

    In the process phase, the pseudo-code is transformed into a DAG, and it is calculated by the line order in each iteration. This paper can see that when calculating the next element, the result of the previous elements must be calculated. This will lead to the pipeline stall. Fortunately, the out-of-order engine component of the current processor can only mitigate the restrict of data dependency to some extent. For this case, the calculation of an element requires 4 additions, 3 multiplications, 5 loads and 1 store instruction. Since the pipeline needs to wait for the result of previous elements, it still takes 3 cycles to get an addition or multiplication result in the worst case. In RAM phase, all data access is regular and the data is generally small, it does not cause any data transmission, so RAM phase time is 0.

    By Eq.(5), the worst GFLOPS is (3+4)×2.7 / max((4×3),2.5) = 1.575 Gflops and actual measurement of 1.79 Gflops with out-of-order enable. Then this paper uses ECM to model sn-sweep by Kerncraft or Intel architecture code analyzer (IACA), it predict the GFLOPS is (3+4)×2.7 / max((4×1),2.5)=4.725 Gflops with the ideal instruction throughput. The reason is the data dependencies between instructions are stored in an index_i_up array which ECM fails to find and model it. Then from Table 4, this paper extends to predict performance with different data size from L1 to MEM by PRF and ECM model. The calculation of one point requires data transfer of 5th-8th of the cache line. This paper can clearly observe that the prediction accuracy of the PRF is better than the ECM when the amount of data is less than the L3 cache size. So this paper uses the modeled time to apply the feedback optimization phase, and it can be seen that the performance cannot reach the scalar peak as the T1 takes extra time, so the model feedback pipeline is the primary bottleneck which needs to be optimized by Fig.4(a).

    Table 4 GFLOPS performance in different data sizes for naive code,

    4.3.3 Optimization method and feedback

    By comparing the overhead of each phase, this paper can see that the computation instruction becomes a bottleneck, and the optimized GFLOPS can be achieved when the instruction pipeline is optimized: (3+4)×2.7 / max((4×1),2.5) = 4.725 Gflops. This paper found that there is no data dependence between the diagonal elements and that the calculation order does not affect the final result by analyzing the DAG. As shown in Fig.11, this paper improves the algorithm to optimize the instruction pipeline by using the diagonal calculation order, and rearranges partial instruction with adding some calculation of subscripts, then the optimized GFLOPS is 4.43 Gflops. Based on this optimization, this paper can continue model sn-sweep, analyze bottlenecks by feedback optimization and increase SIMD operation to boost performance. Then this paper accelerates the addition and multiplication instructions by AVX2, and achieves a certain degree of vectorization and reaches the optimal performance of 10.53 Gflops. Through the experiment, this paper finally converts a data dependence problem into memory access instruction bottleneck. At the same time, the inaccurate model prediction will affect the developer to select the corresponding optimization method, resulting in the selected method does not have any optimization effect. This discovery also shows that the bottleneck will change under different optimization methods, so this paper needs to model optimized kernel and explore more in-depth optimization. That is the meaning of feedback optimization.

    Fig.11 The computational order of eliminating data dependencies

    5 Conclusion

    The PRF model is described in this paper which provides an insightful perspective to predict compute performance and to generate targeted optimization guidance. This work introduces the PRF performance model, and described in detail about the process phase, RAM phase and feedback optimization phase. Then this paper applies the model to convolution, SpMV and sn-sweep which are typical representatives of regular to irregular memory access and data dependence. It can continue to cover more applications. Table 5 compares with Roofline model and ECM model, the proposed PRF model greatly improves predict accuracy for data dependence and irregular memory access by newly designed DAG and cache simulator, and achieves comprehensive feedback ability.

    Table 5 Comparison with 3 performance models from different perspectives

    The hook indicates the model has this function, the fork represents the model does not have this function and the bar represents

    incomplete function

    国产精品三级大全| 国产 一区精品| 汤姆久久久久久久影院中文字幕 | 久久久久久久久久黄片| 特级一级黄色大片| 欧美xxxx黑人xx丫x性爽| 91午夜精品亚洲一区二区三区| 国产午夜精品一二区理论片| 日韩在线高清观看一区二区三区| 神马国产精品三级电影在线观看| 亚洲精品国产成人久久av| 亚洲四区av| 精品久久久久久成人av| 亚洲性久久影院| 久久久久久九九精品二区国产| 国产成人免费观看mmmm| 插阴视频在线观看视频| 亚洲国产精品久久男人天堂| 少妇熟女aⅴ在线视频| 久久久亚洲精品成人影院| 亚洲精品色激情综合| videossex国产| 一级av片app| 又黄又爽又刺激的免费视频.| 国产一级毛片在线| 欧美色视频一区免费| 天天躁日日操中文字幕| 国产老妇女一区| 国产日韩欧美在线精品| 日韩 亚洲 欧美在线| 听说在线观看完整版免费高清| 国产成年人精品一区二区| 高清av免费在线| 高清av免费在线| 亚洲怡红院男人天堂| 欧美激情国产日韩精品一区| 欧美成人精品欧美一级黄| 欧美xxxx黑人xx丫x性爽| 99国产精品一区二区蜜桃av| 亚洲自偷自拍三级| 小蜜桃在线观看免费完整版高清| 亚洲五月天丁香| 国产三级中文精品| 久久久久久久久久成人| 日本三级黄在线观看| 国产日韩欧美在线精品| 伦精品一区二区三区| 亚洲真实伦在线观看| 亚洲最大成人av| 在线免费观看不下载黄p国产| 国产成人a区在线观看| 在线天堂最新版资源| 少妇被粗大猛烈的视频| 美女国产视频在线观看| av播播在线观看一区| 国产精品1区2区在线观看.| 女的被弄到高潮叫床怎么办| 国产亚洲一区二区精品| 精品一区二区三区视频在线| 嫩草影院入口| 色播亚洲综合网| 在线免费十八禁| 欧美一级a爱片免费观看看| 你懂的网址亚洲精品在线观看 | 少妇熟女欧美另类| 久久久久免费精品人妻一区二区| 深夜a级毛片| a级毛片免费高清观看在线播放| 丝袜美腿在线中文| 国产精品久久久久久av不卡| 六月丁香七月| 毛片女人毛片| 国产精品熟女久久久久浪| av在线蜜桃| 天堂中文最新版在线下载 | 三级经典国产精品| 白带黄色成豆腐渣| 欧美激情国产日韩精品一区| 人人妻人人澡人人爽人人夜夜 | 色噜噜av男人的天堂激情| 观看美女的网站| 亚洲精品456在线播放app| 熟妇人妻久久中文字幕3abv| 亚洲精品成人久久久久久| 69人妻影院| 三级国产精品片| 亚洲精品一区蜜桃| 久久久a久久爽久久v久久| 精品无人区乱码1区二区| 国产成人a∨麻豆精品| 99久久无色码亚洲精品果冻| 亚洲aⅴ乱码一区二区在线播放| 一边摸一边抽搐一进一小说| 乱码一卡2卡4卡精品| 亚洲经典国产精华液单| 色吧在线观看| 听说在线观看完整版免费高清| 我要看日韩黄色一级片| 99热这里只有是精品在线观看| 麻豆国产97在线/欧美| 日日撸夜夜添| 久久久久久久久久黄片| 亚洲欧洲日产国产| 天天躁日日操中文字幕| 日韩视频在线欧美| 在线观看av片永久免费下载| 嫩草影院新地址| 成年免费大片在线观看| 亚洲av二区三区四区| 免费av毛片视频| 日本av手机在线免费观看| 嫩草影院入口| 免费观看的影片在线观看| 黑人高潮一二区| 国产免费一级a男人的天堂| 九草在线视频观看| 搞女人的毛片| av在线蜜桃| 久久久精品94久久精品| 老司机福利观看| .国产精品久久| 日本wwww免费看| 久久久久九九精品影院| 少妇的逼水好多| 91aial.com中文字幕在线观看| 最近中文字幕高清免费大全6| 一卡2卡三卡四卡精品乱码亚洲| 白带黄色成豆腐渣| 床上黄色一级片| 丰满少妇做爰视频| 99热6这里只有精品| 免费观看的影片在线观看| 深爱激情五月婷婷| 亚洲熟妇中文字幕五十中出| 亚洲国产欧洲综合997久久,| 国产高清国产精品国产三级 | 日本黄色视频三级网站网址| 久久久精品欧美日韩精品| 极品教师在线视频| 熟女电影av网| 精品一区二区免费观看| 激情 狠狠 欧美| 欧美成人一区二区免费高清观看| 九九热线精品视视频播放| 男人舔奶头视频| 日日啪夜夜撸| 久久亚洲精品不卡| 一本—道久久a久久精品蜜桃钙片 精品乱码久久久久久99久播 | 真实男女啪啪啪动态图| 天天一区二区日本电影三级| 精品国产一区二区三区久久久樱花 | 成人综合一区亚洲| 久久人妻av系列| 少妇的逼水好多| 插逼视频在线观看| 国产精品女同一区二区软件| 一卡2卡三卡四卡精品乱码亚洲| 国产在线男女| 欧美区成人在线视频| 日韩,欧美,国产一区二区三区 | 午夜视频国产福利| 国产一级毛片在线| 观看免费一级毛片| 中文天堂在线官网| 99久久无色码亚洲精品果冻| 亚洲av不卡在线观看| 中国美白少妇内射xxxbb| 人妻制服诱惑在线中文字幕| 亚洲色图av天堂| 九草在线视频观看| 赤兔流量卡办理| 久久久久性生活片| 欧美另类亚洲清纯唯美| 久久久精品大字幕| 免费av观看视频| 午夜福利高清视频| 九九热线精品视视频播放| 免费看日本二区| 你懂的网址亚洲精品在线观看 | a级毛色黄片| 亚洲国产精品久久男人天堂| 桃色一区二区三区在线观看| 在线免费十八禁| 国产精品一二三区在线看| 国产免费又黄又爽又色| 看片在线看免费视频| 黄色日韩在线| 国产av不卡久久| 国产精品一二三区在线看| 国产成人freesex在线| 蜜臀久久99精品久久宅男| eeuss影院久久| 男女啪啪激烈高潮av片| 久久欧美精品欧美久久欧美| 成人毛片a级毛片在线播放| 午夜福利在线在线| av国产免费在线观看| 国产成人福利小说| 嘟嘟电影网在线观看| 三级经典国产精品| 国产精品一及| videossex国产| 欧美成人精品欧美一级黄| 男人舔奶头视频| .国产精品久久| 女人久久www免费人成看片 | 国产精品精品国产色婷婷| 日本免费a在线| 毛片一级片免费看久久久久| 成年av动漫网址| 亚洲第一区二区三区不卡| 国产男人的电影天堂91| 性色avwww在线观看| 国产精品国产三级国产专区5o | 中文字幕久久专区| 男女那种视频在线观看| 国产精品久久久久久久电影| 精华霜和精华液先用哪个| 欧美一区二区国产精品久久精品| 久久6这里有精品| 国产一区有黄有色的免费视频 | av福利片在线观看| 波多野结衣高清无吗| 菩萨蛮人人尽说江南好唐韦庄 | 草草在线视频免费看| 国产一区二区亚洲精品在线观看| 午夜亚洲福利在线播放| 国产单亲对白刺激| 日韩人妻高清精品专区| 午夜福利视频1000在线观看| 久久精品熟女亚洲av麻豆精品 | 午夜激情欧美在线| 毛片女人毛片| 最近的中文字幕免费完整| 欧美潮喷喷水| 黄色配什么色好看| 久久99蜜桃精品久久| 99国产精品一区二区蜜桃av| 午夜免费男女啪啪视频观看| 亚洲精品色激情综合| 韩国高清视频一区二区三区| 大香蕉久久网| 午夜a级毛片| 欧美性猛交黑人性爽| 日本av手机在线免费观看| 国产淫片久久久久久久久| 蜜桃久久精品国产亚洲av| 亚洲在久久综合| 日产精品乱码卡一卡2卡三| 日韩av不卡免费在线播放| 午夜亚洲福利在线播放| 国产探花在线观看一区二区| 亚洲一级一片aⅴ在线观看| 久久精品夜色国产| 国产欧美日韩精品一区二区| 国产老妇女一区| 国产爱豆传媒在线观看| 2021天堂中文幕一二区在线观| av免费在线看不卡| 日日摸夜夜添夜夜爱| 国产乱来视频区| 亚洲欧美成人精品一区二区| 久久久成人免费电影| a级毛色黄片| 国产精品一区二区性色av| 日本免费a在线| 亚洲av电影不卡..在线观看| 日本熟妇午夜| 在线a可以看的网站| 国产乱人视频| 美女大奶头视频| 亚洲18禁久久av| 国产精品久久视频播放| 如何舔出高潮| 国语自产精品视频在线第100页| 听说在线观看完整版免费高清| 午夜精品一区二区三区免费看| 91午夜精品亚洲一区二区三区| 青春草亚洲视频在线观看| 人妻系列 视频| 最近最新中文字幕大全电影3| 国产日韩欧美在线精品| 特级一级黄色大片| 久久久午夜欧美精品| 91久久精品国产一区二区三区| 亚洲国产色片| 色尼玛亚洲综合影院| 国语自产精品视频在线第100页| 亚州av有码| 啦啦啦啦在线视频资源| 最近中文字幕2019免费版| 亚洲伊人久久精品综合 | 亚洲av一区综合| 日韩亚洲欧美综合| 国产v大片淫在线免费观看| 三级经典国产精品| 我的女老师完整版在线观看| 日日摸夜夜添夜夜爱| 精品久久久久久成人av| 日韩精品青青久久久久久| 日本欧美国产在线视频| 插逼视频在线观看| 乱人视频在线观看| 嘟嘟电影网在线观看| 欧美变态另类bdsm刘玥| 欧美区成人在线视频| 午夜福利在线在线| av国产免费在线观看| 日韩av在线大香蕉| 国产又黄又爽又无遮挡在线| 久久久久国产网址| 亚洲美女视频黄频| 久久久久久久国产电影| 久久精品国产自在天天线| 国产精品国产三级国产专区5o | 免费观看性生交大片5| 晚上一个人看的免费电影| 日韩成人伦理影院| 亚洲成人av在线免费| 午夜日本视频在线| 床上黄色一级片| 久久精品久久久久久久性| 亚洲性久久影院| www日本黄色视频网| 日韩高清综合在线| 午夜福利在线在线| 天天躁日日操中文字幕| 久久久久国产网址| 欧美zozozo另类| 亚洲欧美成人精品一区二区| 精品一区二区免费观看| 一本久久精品| 青春草视频在线免费观看| 欧美日韩综合久久久久久| 91av网一区二区| 日本猛色少妇xxxxx猛交久久| 18禁裸乳无遮挡免费网站照片| 久久久久久久国产电影| 黄色日韩在线| 亚洲av成人精品一二三区| av卡一久久| 91精品一卡2卡3卡4卡| 久久久精品94久久精品| 小蜜桃在线观看免费完整版高清| 久久久成人免费电影| 久99久视频精品免费| 99热精品在线国产| 插逼视频在线观看| 国产伦精品一区二区三区视频9| 欧美性猛交黑人性爽| 成年免费大片在线观看| 天天一区二区日本电影三级| 日韩欧美精品v在线| 女人十人毛片免费观看3o分钟| 亚洲天堂国产精品一区在线| 99久久中文字幕三级久久日本| 99久久成人亚洲精品观看| 亚洲精品,欧美精品| 国产精品久久久久久av不卡| 久久精品国产亚洲av天美| 内射极品少妇av片p| 久久久亚洲精品成人影院| 亚洲精品色激情综合| 亚洲人与动物交配视频| 伊人久久精品亚洲午夜| 哪个播放器可以免费观看大片| 一级二级三级毛片免费看| 中文字幕亚洲精品专区| 色噜噜av男人的天堂激情| 18禁动态无遮挡网站| 国产极品天堂在线| 欧美成人一区二区免费高清观看| 午夜精品一区二区三区免费看| 日本黄色视频三级网站网址| 亚州av有码| 国产精品久久久久久精品电影小说 | 人妻夜夜爽99麻豆av| 午夜老司机福利剧场| 老司机影院毛片| 国模一区二区三区四区视频| 国产午夜精品一二区理论片| 久久99热这里只有精品18| 欧美xxxx黑人xx丫x性爽| 国产在视频线在精品| 日韩在线高清观看一区二区三区| ponron亚洲| 国产精品三级大全| 秋霞在线观看毛片| 日本熟妇午夜| 免费不卡的大黄色大毛片视频在线观看 | 免费大片18禁| 国产色婷婷99| 久久99精品国语久久久| 天美传媒精品一区二区| 亚洲美女搞黄在线观看| 三级经典国产精品| 国产一区亚洲一区在线观看| 亚洲最大成人手机在线| 亚洲欧美成人精品一区二区| 亚洲综合色惰| 成人av在线播放网站| 亚洲无线观看免费| www日本黄色视频网| 麻豆av噜噜一区二区三区| 亚洲国产精品sss在线观看| 中文字幕亚洲精品专区| 2022亚洲国产成人精品| 99在线视频只有这里精品首页| 搞女人的毛片| 亚洲av.av天堂| 一区二区三区四区激情视频| 99热全是精品| 精品少妇黑人巨大在线播放 | 欧美成人精品欧美一级黄| av福利片在线观看| 国产精品三级大全| 成年av动漫网址| 少妇被粗大猛烈的视频| 国产av一区在线观看免费| 在线免费观看的www视频| 乱码一卡2卡4卡精品| 色噜噜av男人的天堂激情| 久久精品夜色国产| av天堂中文字幕网| 亚洲av福利一区| 国产精品永久免费网站| 亚洲精品自拍成人| 校园人妻丝袜中文字幕| 国产亚洲av片在线观看秒播厂 | 亚洲av不卡在线观看| 日本黄色视频三级网站网址| 日本免费a在线| 赤兔流量卡办理| 亚洲一级一片aⅴ在线观看| 成人鲁丝片一二三区免费| 日本-黄色视频高清免费观看| 亚洲婷婷狠狠爱综合网| 国产免费福利视频在线观看| 少妇熟女欧美另类| 最新中文字幕久久久久| 晚上一个人看的免费电影| 村上凉子中文字幕在线| 精品一区二区三区视频在线| 看十八女毛片水多多多| 中文字幕熟女人妻在线| 晚上一个人看的免费电影| 22中文网久久字幕| 国产精品久久视频播放| 在线播放无遮挡| 国产欧美另类精品又又久久亚洲欧美| 中文字幕免费在线视频6| 亚洲精品亚洲一区二区| 插阴视频在线观看视频| 大香蕉久久网| 国产亚洲av嫩草精品影院| 天天躁夜夜躁狠狠久久av| 国产69精品久久久久777片| 亚洲成人精品中文字幕电影| 久久久久性生活片| 久久久久久九九精品二区国产| 最后的刺客免费高清国语| 亚洲欧美成人精品一区二区| 97人妻精品一区二区三区麻豆| 亚洲四区av| 欧美不卡视频在线免费观看| 成人特级av手机在线观看| 男女那种视频在线观看| 婷婷六月久久综合丁香| 日韩一区二区三区影片| 两个人的视频大全免费| 久久精品夜色国产| 欧美日韩精品成人综合77777| 中文在线观看免费www的网站| 麻豆国产97在线/欧美| 国产久久久一区二区三区| 亚洲熟妇中文字幕五十中出| 欧美另类亚洲清纯唯美| 汤姆久久久久久久影院中文字幕 | 亚洲欧美精品自产自拍| 偷拍熟女少妇极品色| 免费人成在线观看视频色| 亚洲综合色惰| 99久久人妻综合| 精品久久久久久电影网 | 久久久精品94久久精品| 可以在线观看毛片的网站| 亚洲av男天堂| 亚洲一区高清亚洲精品| 午夜福利高清视频| 精品久久久久久久末码| 在线播放无遮挡| 在线免费观看的www视频| 欧美高清性xxxxhd video| 少妇人妻一区二区三区视频| 午夜激情福利司机影院| 国产爱豆传媒在线观看| 免费黄色在线免费观看| 久久久久精品久久久久真实原创| 大又大粗又爽又黄少妇毛片口| 免费观看人在逋| 九九在线视频观看精品| 国内精品宾馆在线| 寂寞人妻少妇视频99o| 99久国产av精品国产电影| 天天一区二区日本电影三级| 中国国产av一级| 免费观看精品视频网站| 亚洲欧美成人综合另类久久久 | 夫妻性生交免费视频一级片| 一二三四中文在线观看免费高清| 日韩国内少妇激情av| 国产精品精品国产色婷婷| 成人一区二区视频在线观看| 最近最新中文字幕免费大全7| 国产欧美日韩精品一区二区| 欧美激情在线99| 亚洲中文字幕一区二区三区有码在线看| 99热这里只有是精品在线观看| 国产不卡一卡二| 99久久人妻综合| 亚洲怡红院男人天堂| 一卡2卡三卡四卡精品乱码亚洲| 亚洲乱码一区二区免费版| 男插女下体视频免费在线播放| 联通29元200g的流量卡| 日本色播在线视频| 99热全是精品| 欧美日本亚洲视频在线播放| 人妻系列 视频| 日日啪夜夜撸| 日本欧美国产在线视频| 国产午夜福利久久久久久| 国产精品嫩草影院av在线观看| 91aial.com中文字幕在线观看| 真实男女啪啪啪动态图| 综合色丁香网| 天堂中文最新版在线下载 | 亚洲性久久影院| 中国国产av一级| 国产高清三级在线| 少妇丰满av| 日本黄色片子视频| 国产av在哪里看| 国产精品国产三级国产专区5o | 久久国内精品自在自线图片| 人人妻人人澡人人爽人人夜夜 | 亚洲欧美清纯卡通| 久久99精品国语久久久| 久久久a久久爽久久v久久| 欧美性猛交黑人性爽| 国产伦一二天堂av在线观看| 嫩草影院新地址| 亚洲av二区三区四区| 欧美成人a在线观看| 特级一级黄色大片| 国产亚洲午夜精品一区二区久久 | 高清日韩中文字幕在线| 精品人妻一区二区三区麻豆| 日韩精品有码人妻一区| 欧美成人精品欧美一级黄| 亚洲真实伦在线观看| 久久久久久久国产电影| 成人特级av手机在线观看| 一级黄色大片毛片| 啦啦啦啦在线视频资源| 亚洲av成人av| 中国美白少妇内射xxxbb| 国产黄片视频在线免费观看| 欧美最新免费一区二区三区| 毛片一级片免费看久久久久| 精品国产一区二区三区久久久樱花 | 久久精品国产亚洲网站| 日韩欧美 国产精品| 午夜爱爱视频在线播放| av播播在线观看一区| 久久久久久久国产电影| 国产精品久久久久久精品电影| 国产精品永久免费网站| 色综合亚洲欧美另类图片| 最近最新中文字幕大全电影3| 一级毛片电影观看 | 成人漫画全彩无遮挡| 免费看光身美女| 一级毛片我不卡| 久久久久久九九精品二区国产| 三级毛片av免费| 中文欧美无线码| 亚洲精品456在线播放app| 黄片wwwwww| 欧美另类亚洲清纯唯美| 欧美日韩在线观看h| 一级av片app| 国产午夜精品久久久久久一区二区三区| 成人鲁丝片一二三区免费| 26uuu在线亚洲综合色| 高清在线视频一区二区三区 | 中文乱码字字幕精品一区二区三区 | 国内少妇人妻偷人精品xxx网站| 久久久久精品久久久久真实原创| 天堂影院成人在线观看| 国产一区亚洲一区在线观看| av专区在线播放| 91久久精品国产一区二区三区| 国产在线男女| 偷拍熟女少妇极品色| 岛国毛片在线播放| 日韩欧美精品免费久久| 国产午夜福利久久久久久| 欧美xxxx性猛交bbbb| 中文欧美无线码| 蜜桃久久精品国产亚洲av| 国产成年人精品一区二区| 国产成人午夜福利电影在线观看| 成人无遮挡网站| 亚洲国产精品sss在线观看|