• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    A GPU accelerated finite volume coastal ocean model*

    2017-09-15 13:55:51XudongZhao趙旭東ShuxiuLiang梁書秀ZhaochenSun孫昭晨XizengZhao趙西增JiawenSun孫家文ZhongboLiu劉忠波
    水動力學研究與進展 B輯 2017年4期
    關鍵詞:梁書孫家旭東

    Xu-dong Zhao (趙旭東), Shu-xiu Liang (梁書秀), Zhao-chen Sun (孫昭晨), Xi-zeng Zhao (趙西增), Jia-wen Sun (孫家文), Zhong-bo Liu (劉忠波)

    1.State Key Laboratory of Coastal and Offshore Engineering, Dalian University of Technology, Dalian 116023, China, E-mail: zhaoxvdong@dlut.edu.cn

    2.Ocean College, Zhejiang University, Zhoushan 316021, China

    3.National Marine Environmental Monitoring Center, State Oceanic Administration, Dalian 116023, China

    4.Transportation Management College, Dalian Maritime University, Dalian 116026, China

    A GPU accelerated finite volume coastal ocean model*

    Xu-dong Zhao (趙旭東)1, Shu-xiu Liang (梁書秀)1, Zhao-chen Sun (孫昭晨)1, Xi-zeng Zhao (趙西增)2, Jia-wen Sun (孫家文)3, Zhong-bo Liu (劉忠波)4

    1.State Key Laboratory of Coastal and Offshore Engineering, Dalian University of Technology, Dalian 116023, China, E-mail: zhaoxvdong@dlut.edu.cn

    2.Ocean College, Zhejiang University, Zhoushan 316021, China

    3.National Marine Environmental Monitoring Center, State Oceanic Administration, Dalian 116023, China

    4.Transportation Management College, Dalian Maritime University, Dalian 116026, China

    With the unstructured grid, the Finite Volume Coastal Ocean Model (FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture (CUDA) C code, and optimized on the Graphic Processor Unit (GPU). The proposed GPU-FVCOM is tested against analytical solutions for two standard cases in a rectangular basin, a tide induced flow and a wind induced circulation. It is then applied to the Ningbo’s coastal water area to simulate the tidal motion and analyze the flow field and the vertical tide velocity structure. The simulation results agree with the measured data quite well. The accelerated performance of the proposed 3-D model reaches 30 times of that of a single thread program, and the GPU-FVCOM implemented on a Tesla k20 device is faster than on a workstation with 20 CPU cores, which shows that the GPU-FVCOM is efficient for solving large scale sea area and high resolution engineering problems.

    Graphic Processor Unit (GPU), 3-D ocean model, unstructured grid, finite volume coastal ocean model (FVCOM)

    Introduction

    The parallel computational schemes and the large scale, high resolution climate and ocean simulation models see a rapid development. The traditional CPU parallel algorithms for large scale ocean simulations are generally performed by the domain decomposition, where the domain is divided into many sub-domains with each sub-domain dealt by a different processor using distributed or shared memory computing. These parallel computational ocean models are mostly based on the Message Passing Interface (MPI) library, requiring High Performance Computing (HPC) clusters to realize the high computational capacity. Therefore, the high performance is difficult to achieve withoutaccess to HPC computers.

    In recent years, microprocessors based on a single Central Processing Unit (CPU) have greatly enhanced the performance and reduced the cost in computer applications[1,2]. However, the increase of the CPU computing capacity is limited due to heating and transistor density limitations. However, the CPU is not the only processing unit in the computer system. The Graphic Processor Unit (GPU) is initially used for rendering images, but is also a highly parallel device[3]. The GPU’s main task remains rendering video games, which is achieved by a fine gained parallelism for pixels rendered with a large number of microprocessors. Multiple threaded processors, and particularly, the GPUs, have enhanced the floating point performance since 2003[4]. NVIDIA provided a GPU computing Software Development Kit (SDK) in 2006 that extended the C programming language to use GPUs for general purpose computing. The ratio between multiple core GPUs and CPUs in the floating point calculation throughput is approximately 10:1 in 2009, i.e., the multiple core GPUs could reach 1 teraflops (1 000 gigaflops), while CPUs could onlyreach 100 gigaflops. Figure 1 shows that with the GPU Computing Accelerator, NVIDIA Tesla K40M, the double precision floating point performance has now reached 1.68 Tflops.

    Fig.1(a) Floating point operation performance

    Fig.1(b) Memory bandwidth for the CPU and GPU

    A number of programs and applications have been ported to the GPU, such as the lattice Boltzmann method[5], Ansys[6], and DHI MIKE. The use of the GPU devices was also explored for ocean and atmosphere predictions[7,8]. Michalakes and Vachharajani[9]proposed a Compute Unified Device Architecture (CUDA) C based weather prediction model. They achieved a twenty-fold speedup (by using NVIDIA 8800 GTX) compared to a single-thread Fortran program running on a 2.8 GHz Pentium CPU[9]. Most existing climate and ocean models are only accelerated for specific loops using the open accelerator application program interface (OpenACC-API) or CUDA Fortran, therefore these GPU accelerated models have achieved limited speedup. Horn implemented the GPU to a moist fully compressible atmospheric model[10]. Xu et al.[11]developed the Princeton Ocean Model POM.gpu v1.0, a full GPU solution based on MPI version of the Princeton Ocean Model (mpiPOM). POM.gpu v1.0 with four GPU devices can match the performance of the mpiPOM with 408 standard CPU cores. However, they were unable to resolve the complex irregular geometries of tidal creeks in an estuarine application[12]. Keller et al.[13]developed a GPU accelerated MIKE21 to solve 2-D hydrodynamics problems, and the latest DHI MIKE (2016) supports the GPU based 3-D hydrodynamic parallel computing.

    The objective of this work is to reduce the computation time for the unstructured grid, Finite Volume Coastal Ocean Model (FVCOM) with a CUDA C parallel algorithm. The CUDA is a minimal extension of the C and C++ programming languages, and allows a program to be generated and executed on GPUs.

    With the FVCOM parallel version, we demonstrate how to develop a GPU based ocean model (GPU-FVCOM) that runs efficiently on a professional GPU. The Fortran FVCOM is first converted to the CUDA C, and optimized for the GPU to further improve the performance.

    In this work, the GPU-FVCOM performance is tested for two standard cases in rectangular basins with different grid numbers and devices, and then applied to the Ningbo coastal region. Particular attention is paid to the tidal current behavior, including the current velocity and the time history of sea levels.

    1. Unstructured grid, finite volume coastal ocean model

    The FVCOM is a prognostic, unstructured grid, finite volume, free surface, 3-D primitive equation coastal ocean circulation model jointly developed by UMASSD-WHOI. To solve the large scale sea area and high grid density problem, the METIS partitioning libraries are used to decompose the domain, and to implement the explicit Single Program Multiple Data (SPMD) parallelization method to the FVCOM[14]. The FVCOM has also been used in a wide range of engineering problems[15-19].

    In the horizontal direction, the spatial derivatives are computed using the cell vertex and cell centroid (CV-CC) multigrid finite volume scheme. In the vertical direction, the FVCOM supports the terrain following sigma coordinates. The 3-D governing equations can be expressed as:

    and

    wheretrepresents the time,Dis the water depth,ξdenotes the surface elevation,xandyare the Cartesian coordinates,σis the sigma coordinate,u,vandwrepresent the velocity components in thex,yandσdirections,gis the coefficient of gravity,fis the coefficient of Coriolis force, andAmandKmrepresent the horizontal and vertical diffusion coefficients, respectively.

    The FVCOM uses the time splitting scheme to separate the 2-D (external mode) from the 3-D (internal mode) momentum equations. The external mode calculation updates the surface elevation and the vertically averaged velocities. The internal mode calculation updates the velocity, the temperature, the salinity, the suspended sediment concentration, and the turbulence quantities.

    The momentum and continuity equations for the internal mode are integrated numerically using the explicit Euler time stepping scheme, and those for the external mode are integrated using a modified second order Runge-Kutta scheme

    wheredenotes the residual,Ωis the area of the control volume element, ΔtErepresents the external time step, the superscriptsnandn+1 denote the external time levels, andkdenotes the Runge-Kutta level. Since the spatial accuracy of the discretization is of the second order, the simplified Runge-Kutta scheme is sufficient.

    The external and the internal modules are the most computationally intensive kernels. In this work, the modified Mellor and Yamada level 2.5 turbulent closure scheme is implemented for the vertical and horizontal mixing, respectively.

    The FVCOM is a multi-function ocean model, and the source code is very large. The GPU-FVCOM implements only the 3-D hydrodynamics, the temperature, the salinity, and the sediment transport. A full port of the FVCOM to the GPU-FVCOM remains to be done in the future research.

    2. Algorithm development

    In contrast to the traditional parallel programs executed on the CPU, the GPU is specialized for intensive, highly parallelizable problems. Many more transistors are designated for data processing rather than data caching and flow control[20]. As many threads should be assigned in parallel as possible, where each thread executes a specific computing task.

    The CUDA allows the user convenient programming and execution of data level parallel programs on the GPU. The data parallel processing maps the data elements to the parallel processing threads. Each thread processes data element rather than the loop computing. The GPU support is now available for the MATLAB and the CUDA development becomes much easier with C/C++/Fortran and Python, however, some new features and math libraries only support the CUDA C language.

    In this work, a massively parallel GPU version of the FVCOM is developed based on the CUDA C. In the GPU-FVCOM, the CPU is only used to initialize variables, output results, and activate device subroutines executed on the GPU. The overall and optimization strategies for the GPU-FVCOM are introduced in the following sections.

    2.1 Preliminary optimization

    The first stage of the GPU-FVCOM development involves the following method to realize data level parallelism without optimization.

    Figure 2 illustrates the code structure of the GPU-FVCOM. The major difference between the FVCOM and the GPU-FVCOM is that the CPU is only used for initializing the relevant arrays and providing the output. The GPU executes all model computing processes, and the data is only sent to the host for the file output at user-specified time steps.

    Fig.2 Flowchart of GPU-FVCOM processing

    (1) Data stored in the global memory and repeatedly accessed in the GPU threads are replaced by temporary variables stored in the local register memory. The register memory in the GPU threads can be accessed faster than the global memory. Figure 3 shows that this approach improves the GPU-FVCOM performance by 20%.

    Fig.3 Speedup ratio of kernel function with/without utilizing register compared with sequential program

    Fig.4 Sample code without/with atomic memory operations

    Fig.5 Loop fusion and original method

    (2) The GPU takes up a large number of threads simultaneously. While solving the momentum equations, the GPU-FVCOM loops with the element edges. With this approach, two or more threads attempting to access the same data will cause memory operation errors and return a wrong answer. To avoid memory access conflicts, atomic memory operations are implemented on the GPU to ensure only one thread at a time performs the read-modify-write operation, as shown in Fig.4. Implementing this process reduces the execution time of the kernel function by almost 50%.

    Fig.6 Ratio of original algorithm to loop fusion method execution times for numbers of different horizontal grids (line type) and sigma layers (x-axis)

    (3) The loop fusion is optimized by merging several sequential loops with the same cycle times into one loop. The loop fusion is an effective and simple method to reduce the global memory access and to transfer the global memory to registers for repeated use of the same data. The GPU reads from the global memory and transfers the value to the register memory. Figure 5 shows the different execution processing of the CUDA parallel program with and without using the loop fusion method. With the loop fusion method, we fuse the sequential loops into one kernel function and use the register variables to reduce the global memory access latency.

    With this method, the GPU-FVCOM computational performance is significantly improved. Figure 6 shows that the speedup ratios (the ratio of the original algorithm execution time to the loop fusion method execution time) for different horizontal grid numbers increase with the increase of the number of sigma levels.

    Fig.7 Passing a 2-D array from host memory to device memory

    Fig.8 The relationship ofσlayer distribution and threads assignment

    Thus, the execution speed is improved by 70%, if there are adequate horizontal grids and sigma levels.

    2.2 Special optimization

    (1) Implementation of multidimensional array data. Since the GPU and CPU memories are independent,the CUDA cannot simply use the “cudaMemcpy”to transfer the multidimensional array variables (dynamically allocated) between the CPU and the GPU. In the CUDA API libraries, the“cudamalloc” function is only used to allocate a linear memory on the device and returns a pointer to the allocated graphic memory, the “cudaMemcpy” copies the data from the memory area pointed to by the source memory pointer to the memory area pointed to by the destination memory pointer. Figure 7 shows how the novel array passing method is implemented in this work. Consequently, the multidimensional arrays can be directly accessed by the kernel function.

    Fig.9 Kernel scaling on stream multiprocessors

    Table 1 GPU utilization for different numbers of threads per block

    (2) Optimization with height dependence release. In some 3-D kernel functions,kcomponents of the sigma layers are not independent when solving the vertical diffusion term of the momentum and turbulence equations. The GPU-FVCOM uses akcomponent loop of the sigma layer in the kernel function to reduce the shared data access, and in each thread,kcomponents are computed. For the horizontal advection term, thekth component does not depend on other layer components. The GPU grid size is divided intoklevels, and in each thread, one component is computed. The kernel functions are shown in Fig.8.

    The height dependence is an effective method to hide the global memory access latency. It stores scalar variables in registers for the data reuse. The execution time is thus reduced by 10%-50%.

    (1) Optimization with block size. The streaming multiprocessors (SMs) are the part of the GPU that runs the CUDA kernel functions. The occupancy of each SM is limited by the thread number per block, register usage, and shared memory usage. As shown in Fig.9, the threads in each block are executed by the same SM. Each SM splits its own blocks into warps (currently 32 threads per warp). The GPU device needs more than one thread per core to hide the global memory latency and run with high performance. To fully use the computational capability of each SM, the thread number in each block should be a multiple of the warp size.

    The Tesla K20x card supports up to 64 active warps per streaming multiprocessor, and contains 14 multiprocessors and 2 688 cores (192 stream processors per multiprocessor). The Single Instruction, Multiple Thread (SIMT) execution model is implemented in the CUDA architecture. The GPU utilization for different computing capability for different threads per block are shown in Table 1. The only case that obtains 100 percent utilization across all generations of the GPU device is 256 threads per block. Several tests are conducted, and the experimental results are consistent with Table 1. Therefore, to en-sure the maximum GPU utilization, we assign 256 threads per block.

    3. Numerical scheme verification

    Two test cases are selected to analyze the GPU-FVCOM computing efficiency and accuracy. Both cases are run on a variety of GPU devices, and the speedup ratios are approximately the same in each case. Thus, for space considerations only the speedup ratios for the first case with different devices, parallel tools, and grid numbers are presented.

    3.1 Verification of 2-D and 3-D M2tidal induced current circulation

    The GPU-FVCOM is applied to the tide induced circulation in a rectangular basin with a length of 3.4 km, a width of 1.6 km and a depth of 10 m. The geometry and the horizontal meshes are shown in Fig.10.

    Fig.10 Horizontal meshes for 2-D and 3-D tidal induced circulation

    AnM2tide with the maximum amplitude of 1 m at the open boundary is used as the free surface water elevation. In this case, the Coriolis force coefficient isf=0, and the analytical solution is

    and

    whereAis the tidal amplitude,ω=2π/Tis the angular frequency,his the water depth,xis the distance between the verified point to the left solid boundary,Tis theM2tidal period,ξis the free surface elevation, anduis the longitudinal component of the depth-average horizontal velocity. The numerical simulation parameters are shown in Table 2.

    Figure 11 compares the computed free surface elevation and the depth-averaged velocities at the verified point with the analytical solutions. The maximumwater surface elevations are in agreement within a margin of 3.5% and the maximumu-velocity components are in agreement within a margin of 4.2%.

    Table 2 Model parameters for the tidal induced test case

    Fig.11 Analytical and computed of the basin

    3.2 Verification of 3-D wind induced circulation

    This experiment validates the proposed model’s capability in simulating the wind induced circulation in a closed rectangular basin, as shown in Fig.12. Using the no-slip condition at the channel bed, the uniform and steady state analytical solutions for the horizontal velocity component in a well-mixed channel with a known constant vertical eddy viscosity coefficient is obtained aswhereuis the horizontal velocity component inx-direction,his the water depth,ξis the free surface elevation above the initial water elevation,is the eddy viscosity coefficient in the vertical direction,is thexcomponent of the surface wind stress, andρis the water density.

    Fig.12 Horizontal meshes for wind induced circulation

    Table 3 Model parameters for the wind induced test case

    Fig.13 Analytical and computedx-component of horizontal velocity profiles in vertical direction at the center of the rectangular basin

    Bottom stress is

    A rectangular basin with a length of 2.0 km and a width of 0.8 km is used, with related parameters as shown in Table 3.

    The convergence results are obtained from a cold start after the program runs for 10 internal time steps. To avoid the influence of the reflection from the solid boundary, the central point of the rectangular basin is chosen as the verification point, and the computed horizontal velocity along the vertical profile is compared with the analytical solution, as shown in Fig.13.

    3.3 Comparisons of computational performance on variety of devices

    Computational performance of sequential programming, the MPI parallel code, and the GPU code for the ocean model is compared for three different GPU devices, shown in Table 4. The simulations are run for five horizontal meshes: 1 020, 4 080, 16 320, 65 280 and 261 120, which are implemented on the 3-D GPU-FVCOM. To fully utilize the GPU performance, the kernel function is used to solve the advection diffusion equation of each element.

    Table 4 Hardware device versions

    Figure 14 compares the speedup for 3-D simulations for a variety of horizontal meshes and devices in the tidal induced case. The speedup ratio is the computational time for a single thread program divided by that for a parallel program.

    Fig.14 Speedup of 3-D simulations for a variety of meshes and devices in the tidal induced case

    When the number of the grid nodes is less than 16 320, the 2-D and 3-D GPU-FVCOM speedup ratios are less than that of the MPI parallel code. However, when the number of the grid nodes reaches 65 280, the 3-D GPU-FVCOM speedup ratio exceeds that of the MPI parallel code.

    Speedup ratio is more than 31 for the 3-D simulation with the 261120 nodes grid. For a low horizontal resolution, the GPU-FVCOM speedup running on the Tesla K20M is less than that of the MPI parallel program running on a 32 thread HPC cluster.However, the speedup of the MPI parallel with 32 threads decreases slightly when the number of the grid nodes approaches 261120, whereas the GPU-FVCOM speedup is still in increase. Thus, with the grid sizes in the trial run, the Tesla K20M performance is not fully exploited. Further increasing the number of threads (with larger grid density) will produce larger speedups for professional GPU computing devices.

    Fig.15 Simulation time for each data update with different sigma levels and horizontal meshes in the tidal induced case executed on Tesla K20Xm

    Figure 15 compares the execution time for each data update with different sigma levels and horizontal meshes in the tidal induced case executed on the Tesla K20Xm. The 5-sigma-layer execution time and the 50-sigma-layer execution time are 2-4 times and 5-19 times of that of the 2-D model. Extrapolating from these results, the 2-D model performance is 28 MCUPS (Million Cells Update Per Second).

    Therefore, the GPU power can be gradually exploited by increasing the grid resolution and sigma levels. When the grid vertices exceed approximately 65 000, the GPU-FVCOM computing efficiency exceeds that of a 32 thread MPI parallel ocean model. Thus, the GPU-FVCOM has a significant potential for large scale domain and high resolution computations while maintaining the practical computing time.

    4. Application to Ningbo’s coastal waters

    4.1 Model configuration and verification

    The GPU-FVCOM is applied to simulate the flow field of the Ningbo water area, which has the most complicated coastline and topography in China.

    The computational domain covers all Ningbo’s coastline and neighboring islands, with more than 174 000 triangle meshes. The horizontal computational domain is shown in Fig.16, covering an area of 200×220 km. The length of the element side varies from 150 m for complex coastlines to 3 500 m for offshore areas, and 11 sigma levels are adopted for the depth.

    Fig.17 Simulated and measured tidal elevations at station T1

    Fig.18 Simulated and measured surface tidal currents at station 1#

    The GPU-FVCOM verification is made by comparing simulated and measured data. The data come from measurements over the period from 30 June to 1 July, 2009. Measurement stations are shown in Fig.16, including 7 tidal elevation stations (T1-T7) and 8 tidalcurrent stations (1#-8#). In this work, T1 and 1# are chosen as the representative stations.

    Figure 17 compares the simulated and observed water surface elevations at station T1. The simulated and measured data show good agreement, with the magnitude and the phase of the free surface water elevation being well matched.

    Figure 18 compares the simulated and observed tide induced currents at stations 1#. The simulated and observed currents show good agreement, with the magnitude and the phase favorably matched.

    4.2 Flow field characteristics

    Figure 19 shows the surface flow field at the time of floor tide. The simulated result shows that the tide flows southeast along the Ningbo and Zhoushan coastlines. The tide flows through Nan Jiushan, then into the Xiangshan Bay and the Hangzhou Bay. The tide direction is along the coastline of the Hangzhou Bay. The average flood tide direction is between 230oand 300o. The flow at the north of the Hangzhou Bay is affected by the flood tides, and the water flux is net input.

    The current speed along the northern coastline is larger than that along the southern shore at the flood tide. After the tide flows into the Xiangshan Bay, the impact of the topography and the reflection of the land boundary gradually turns the tide from a progressive tidal wave to a standing wave, with the tidal range gradually increasing from the entrance to the inner bay.

    Overally, the results show that the GPU-FVCOM can simulate the flow field of the Ningbo’s water area satisfactorily.

    4.3 Computational performance for Ningbo?s coastal waters

    Figure 20 shows the GPU-FVCOM speedup for different GPU devices. Speedup ratios are similar for the same grid number. The GPU-FVCOM executed on the GTX 460 produces a speedup approximately 13 times of that of the 3-D single thread program. The Tesla K20Xm GPU provides a speedup of 27. Hence, the GPU-FVCOM can yield a powerful computational performance to solve arbitrary ocean circulation topographies, and the GPU accelerated FVCOM is significantly faster.

    Fig.20 Speedup ratio for different GPU devices for Ningbo coastal water simulation

    5. Conclusions

    A GPU based parallelized version of the FVCOM is developed and optimized for the GPU parallel implementation. The speedup for the 3-D ocean model is impressive: up to 30 times speedup with high grid densities.

    Two classic experiments are performed to validate the accuracy of the proposed GPU-FVCOM. The 2-D GPU-FVCOM achieves 27 times speedup when the number of grid vertices exceeds 2.6×104. Under the same conditions, the GPU-FVCOM speedup is superior to a 32 thread MPI parallel program running on a cluster. The 3-D GPU-FVCOM achieves 31 times speedup. Even using a laptop graphic card K780M, the computational performance of the 3-D GPU-FVCOM is slightly superior to a 32 thread MPI parallel program running on a high performance cluster.

    The GPU-FVCOM is applied to simulate the tidal motion of the Ningbo coastal water with a complicated coastline. The simulated results for the surface elevations and the current velocity are in goodagreement with the measured data. The simulated velocity shows the current speed differences between the surface and bottom layers. A large quantity of water flows into the Xiangshan Bay and the Hangzhou Bay. After flowing into the Xiangshan Bay, the impact of the topography and the reflection of the land boundary gradually turns the progressive tidal wave to a standing wave. The speedup of the GPU-FVCOM varies for different GPU devices. The 3-D GPUFVCOM model retains a high speedup, and the computational performance is significantly superior to that of a MPI parallel program executed on a high performance workstation with 32 CPU cores.

    The GPU-FVCOM has a tremendous potential to solve massively parallel computational problems, and may be well suited and widely applied for high resolution hydrodynamic and mass transport problems. Future work will be concentrated on the MPI and multiple GPU hybrid accelerated versions of the GPUFVCOM.

    [1] Ruetsch G., Fatica M. CUDA Fortran for scientists and engineers: Best practices for efficient CUDA fortran programming [M]. Burlington, USA: Morgan Kaufmann, 2013.

    [2] Wilt N. The CUDA handbook: A comprehensive guide to GPU programming [M]. Boston, USA: Addison-Wesley Professional, 2013.

    [3] Chen T. Q., Zhang Q. H. GPU acceleration of a nonhydrostatic model for the internal solitary waves simulation [J].Journal of Hydrodynamics, 2013, 25(3): 362-369.

    [4] Kirk D. B., Hwu W. M. Programming massively parallel processors: A hands-on approach [M]. Burlington, USA: Morgan Kaufmann, 2012.

    [5] Bailey P., Myre J., Walsh S. D. et al. Accelerating lattice Boltzmann fluid flow simulations using graphics processors [C].The 38th international conference on parallel processing.Vienna, Austria, 2009.

    [6] Krawezik G. P., Poole G. Accelerating the ANSYS direct sparse solver with GPUs [C].Symposium on Application Accelerators in High Performance Computing. Champaign, USA, 2009.

    [7] Huang M., Mielikainen J., Huang B. et al. Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme [J].Geoscientific Model Development, 2015, 8(9): 2977-2990.

    [8] Lacasta A., Morales-Hernández M., Murillo J. et al. An optimized GPU implementation of a 2D free surface simulation model on unstructured meshes [J].Advances in Engineering Software, 2014, 78: 1-15.

    [9] Michalakes J., Vachharajani M. GPU acceleration of numerical weather prediction [J].Parallel Processing Letters, 2008, 18(4): 531-548.

    [10] Horn S. ASAMgpu V1. 0-a moist fully compressible atmospheric model using graphics processing units (GPUs) [J].Geoscientific Model Development, 2011, 4(2): 345-353.

    [11] Xu S., Huang X., Oey L. Y. et al. POM.gpu-v1.0: A GPU-based princeton ocean model [J].Geoscientific Model Development, 2015, 8(9): 2815-2827.

    [12] Chen C., Huang H., Beardsley R. C. et al. A finite volume numerical approach for coastal ocean circulation studies: Comparisons with finite difference models [J].Journal of Geophysical Research: Oceans, 2007,112(C3): 83-87.

    [13] Keller R., Kramer D., Weiss J. P. Facing the Multicore-Challenge III [M]. Berlin, Heidelberg, Germany: Springer 2013, 129-130.

    [14] Cowles G. W. Parallelization of the FVCOM coastal ocean model [J].International Journal of High Performance Computing Applications.2008, 22(2): 177-193.

    [15] Bai X., Wang J., Schwab D. J. et al. Modeling 1993- 2008 climatology of seasonal general circulation and thermal structure in the Great Lakes using FVCOM [J].Ocean Modelling, 2013, 65(1): 40-63.

    [16] Zhang A., Wei E. Delaware River and Bay hydrodynamic simulation with FVCOM [C].Proceedings of the 10th International Conference on Estuarine and Coastal Modeling, ASCE. Newport, USA. 2008.

    [17] Chen C., Huang H., Beardsley R. C. et al. Tidal dynamics in the Gulf of Maine and New England Shelf: An application of FVCOM [J].Journal of Geophysical Research: Oceans, 2011, 116(C12): 12010.

    [18] Liang S. X., Han S. L., Sun Z. C. et al. Lagrangian methods for water transport processes in a long-narrow bay-Xiangshan Bay, China [J].Journal of Hydrodynamics, 2014, 26(4): 558-567.

    [19] CHEN Y. Y., LIU Q. Q. Numerical study of hydrodynamic process in Chaohu Lake [J].Journal of Hydrodynamics, 2015, 27(5): 720-729.

    [20] Cook S. CUDA programming: a developer?s guide to parallel computing with GPUs [M]. Burlington, USA: Morgan Kaufmann, 2012.

    (Received July 6 2016, Revised November 17, 2016)

    * Project supported by the National Natural Science Foundation of China (Grant No. 51279028, 51479175), the Public Science and Technology Research Funds Projects of Ocean (Grant No. 201405025).

    Biography:Xu-dong Zhao (1986-), Male, Ph. D. Candidate

    Shu-xiu Liang,

    E-mail: sxliang@dlut.edu.cn

    猜你喜歡
    梁書孫家旭東
    LARGE TIME BEHAVIOR OF THE 1D ISENTROPIC NAVIER-STOKES-POISSON SYSTEM*
    開學第一天
    給春天開門
    胡旭東
    心聲歌刊(2019年1期)2019-05-09 03:21:36
    鳥巢大作戰(zhàn)
    蠟筆畫
    大鐘驚魂
    玩轉(zhuǎn)正陽門
    Experimental study on the time-dependent dynamic mechanical behaviour of C60 concrete under high-temperatures
    Lagrangian methods for water transport processes in a long-narrow bay-Xiangshan Bay, China*
    在线免费观看不下载黄p国产| 另类亚洲欧美激情| 热re99久久国产66热| 午夜久久久在线观看| 久久久午夜欧美精品| 国产在线视频一区二区| 久久精品久久久久久久性| 热99国产精品久久久久久7| av国产精品久久久久影院| 欧美日韩视频高清一区二区三区二| 国产精品成人在线| 岛国毛片在线播放| 十八禁高潮呻吟视频 | kizo精华| 亚洲三级黄色毛片| 一级黄片播放器| 午夜免费鲁丝| 中文字幕免费在线视频6| av国产精品久久久久影院| 午夜激情久久久久久久| 69精品国产乱码久久久| 国产黄色免费在线视频| 日韩人妻高清精品专区| 免费高清在线观看视频在线观看| 黑丝袜美女国产一区| 国产精品国产av在线观看| 在线观看免费视频网站a站| 亚洲欧美一区二区三区国产| 一区二区三区乱码不卡18| 丰满迷人的少妇在线观看| 日本午夜av视频| 亚洲欧美日韩东京热| 一级av片app| 国内精品宾馆在线| 午夜91福利影院| 亚洲天堂av无毛| 日韩人妻高清精品专区| 久久人妻熟女aⅴ| av有码第一页| 自线自在国产av| 青春草国产在线视频| 老司机影院成人| 在线观看免费高清a一片| 国产亚洲午夜精品一区二区久久| 欧美日韩视频精品一区| 五月开心婷婷网| 黄色欧美视频在线观看| 免费看日本二区| 2021少妇久久久久久久久久久| 亚洲国产精品一区三区| √禁漫天堂资源中文www| 美女主播在线视频| 欧美日韩一区二区视频在线观看视频在线| 另类亚洲欧美激情| 日韩精品免费视频一区二区三区 | 亚洲国产欧美日韩在线播放 | 亚洲欧美日韩东京热| 日韩欧美一区视频在线观看 | h日本视频在线播放| 国产日韩欧美亚洲二区| 51国产日韩欧美| 国产在视频线精品| 美女cb高潮喷水在线观看| 黄色毛片三级朝国网站 | 99热这里只有是精品50| 极品少妇高潮喷水抽搐| 国产日韩欧美视频二区| 中文字幕精品免费在线观看视频 | 精品一品国产午夜福利视频| 国产精品蜜桃在线观看| a 毛片基地| 嫩草影院入口| 久久青草综合色| 国产亚洲最大av| 亚州av有码| 99久久精品热视频| 免费观看a级毛片全部| 亚洲av国产av综合av卡| 国产成人精品久久久久久| 成人亚洲精品一区在线观看| 国产成人一区二区在线| 欧美老熟妇乱子伦牲交| 大又大粗又爽又黄少妇毛片口| 97在线人人人人妻| 99热全是精品| 简卡轻食公司| 亚洲四区av| 最后的刺客免费高清国语| 久久99热这里只频精品6学生| 视频中文字幕在线观看| 欧美另类一区| 大片电影免费在线观看免费| 日韩不卡一区二区三区视频在线| av在线观看视频网站免费| 国产免费福利视频在线观看| 三级经典国产精品| 国产精品不卡视频一区二区| 亚洲国产精品国产精品| 最近的中文字幕免费完整| 国产片特级美女逼逼视频| 欧美精品亚洲一区二区| 麻豆精品久久久久久蜜桃| av女优亚洲男人天堂| 女性被躁到高潮视频| 国产成人一区二区在线| 亚洲第一区二区三区不卡| 久久久午夜欧美精品| 一边亲一边摸免费视频| 我要看黄色一级片免费的| 亚洲欧美日韩另类电影网站| 亚洲人与动物交配视频| 好男人视频免费观看在线| 日韩免费高清中文字幕av| 精品久久久精品久久久| 免费观看性生交大片5| av一本久久久久| 高清不卡的av网站| 乱系列少妇在线播放| 亚洲精品国产av蜜桃| 人人妻人人看人人澡| 国产一区二区三区综合在线观看 | 国产亚洲最大av| 久久影院123| 国产日韩一区二区三区精品不卡 | www.色视频.com| 国产男人的电影天堂91| av又黄又爽大尺度在线免费看| 草草在线视频免费看| 国产伦精品一区二区三区视频9| 视频区图区小说| 国产精品三级大全| 黑人猛操日本美女一级片| 国产欧美日韩综合在线一区二区 | 免费黄色在线免费观看| 九九爱精品视频在线观看| 国产免费又黄又爽又色| 老熟女久久久| 久久婷婷青草| www.av在线官网国产| 大话2 男鬼变身卡| 国产伦精品一区二区三区视频9| 欧美日韩视频高清一区二区三区二| av卡一久久| 好男人视频免费观看在线| 国产成人aa在线观看| 丝袜脚勾引网站| 下体分泌物呈黄色| 亚洲人成网站在线播| 国产在线男女| 一级,二级,三级黄色视频| 亚洲精品国产成人久久av| 亚洲综合精品二区| 国产乱人偷精品视频| 岛国毛片在线播放| 久久99蜜桃精品久久| 自拍欧美九色日韩亚洲蝌蚪91 | 国产精品久久久久久久久免| 亚洲欧美成人综合另类久久久| 十分钟在线观看高清视频www | 亚洲国产最新在线播放| 久久久久久久久久久丰满| 国产男女超爽视频在线观看| 国产一区二区在线观看av| 久久午夜福利片| av在线app专区| 国产黄片美女视频| 黑人高潮一二区| a 毛片基地| 亚洲精华国产精华液的使用体验| 亚洲熟女精品中文字幕| 久久久久人妻精品一区果冻| 国产片特级美女逼逼视频| 国产免费又黄又爽又色| 中文字幕人妻丝袜制服| 十分钟在线观看高清视频www | 久热这里只有精品99| 亚洲第一区二区三区不卡| 国产精品人妻久久久影院| 亚洲欧美中文字幕日韩二区| 国产伦精品一区二区三区四那| 2022亚洲国产成人精品| 久久综合国产亚洲精品| 久久99热6这里只有精品| 不卡视频在线观看欧美| 精品久久久久久电影网| 99久久精品国产国产毛片| av视频免费观看在线观看| 两个人的视频大全免费| 搡老乐熟女国产| 成人二区视频| 亚洲国产日韩一区二区| 极品人妻少妇av视频| 亚洲综合精品二区| 久久久久久久久久久丰满| 老司机影院毛片| 十分钟在线观看高清视频www | 午夜免费男女啪啪视频观看| 最新的欧美精品一区二区| 亚洲精品久久午夜乱码| 内地一区二区视频在线| 色5月婷婷丁香| 久久精品国产自在天天线| 99久久综合免费| 精品亚洲乱码少妇综合久久| 日韩熟女老妇一区二区性免费视频| 丰满人妻一区二区三区视频av| 色婷婷久久久亚洲欧美| 国产成人午夜福利电影在线观看| 国产一区二区三区av在线| 日日摸夜夜添夜夜爱| av在线app专区| 男女啪啪激烈高潮av片| 91久久精品国产一区二区三区| 久久人人爽人人片av| 成人午夜精彩视频在线观看| 男女免费视频国产| 色网站视频免费| 亚洲国产精品一区二区三区在线| 一边亲一边摸免费视频| 丝袜脚勾引网站| 亚洲欧美日韩东京热| 亚洲精品乱久久久久久| 99热6这里只有精品| 人妻一区二区av| 国产精品秋霞免费鲁丝片| 久久久久网色| 国产午夜精品一二区理论片| 五月伊人婷婷丁香| 国产亚洲最大av| 黑人巨大精品欧美一区二区蜜桃 | 国产精品99久久99久久久不卡 | 中文字幕亚洲精品专区| 不卡视频在线观看欧美| 国产成人免费观看mmmm| 午夜激情福利司机影院| 国产精品不卡视频一区二区| 国产黄色免费在线视频| 欧美日韩视频高清一区二区三区二| 午夜精品国产一区二区电影| 国产高清不卡午夜福利| 免费不卡的大黄色大毛片视频在线观看| 男人爽女人下面视频在线观看| 99热这里只有是精品50| 欧美xxⅹ黑人| 黑人高潮一二区| 2022亚洲国产成人精品| 高清av免费在线| 日韩 亚洲 欧美在线| 十八禁高潮呻吟视频 | 日韩制服骚丝袜av| 国产亚洲精品久久久com| 一级爰片在线观看| 老司机影院成人| 国产 精品1| 亚洲精品乱码久久久v下载方式| h视频一区二区三区| 久久久久久久久久人人人人人人| 亚洲欧洲国产日韩| 欧美日韩国产mv在线观看视频| 欧美精品高潮呻吟av久久| 三级国产精品欧美在线观看| 国产成人精品婷婷| 2021少妇久久久久久久久久久| 精品亚洲成国产av| av.在线天堂| 国产一区二区在线观看av| 91久久精品国产一区二区成人| 亚洲不卡免费看| 成人亚洲欧美一区二区av| 欧美bdsm另类| 欧美日韩国产mv在线观看视频| 久久久国产欧美日韩av| 亚洲人成网站在线观看播放| 最近中文字幕2019免费版| 国产熟女午夜一区二区三区 | 国产探花极品一区二区| 肉色欧美久久久久久久蜜桃| 国产精品秋霞免费鲁丝片| 久久6这里有精品| 国产69精品久久久久777片| 两个人免费观看高清视频 | 天美传媒精品一区二区| 国产在视频线精品| 女的被弄到高潮叫床怎么办| 另类亚洲欧美激情| 国产亚洲5aaaaa淫片| 欧美精品国产亚洲| 亚洲婷婷狠狠爱综合网| 亚洲成人一二三区av| 国产精品99久久久久久久久| 日韩av在线免费看完整版不卡| 自拍欧美九色日韩亚洲蝌蚪91 | 人人妻人人爽人人添夜夜欢视频 | 日韩不卡一区二区三区视频在线| 自拍欧美九色日韩亚洲蝌蚪91 | 中国美白少妇内射xxxbb| 美女中出高潮动态图| 国产精品一区二区在线观看99| 天天操日日干夜夜撸| 亚洲国产av新网站| 亚洲av福利一区| 成人影院久久| 中文乱码字字幕精品一区二区三区| 啦啦啦啦在线视频资源| 欧美高清成人免费视频www| 91午夜精品亚洲一区二区三区| 美女国产视频在线观看| 久久6这里有精品| 男人狂女人下面高潮的视频| 中文资源天堂在线| 亚洲真实伦在线观看| 久久久久久久久大av| 天堂8中文在线网| 亚洲国产精品成人久久小说| 少妇精品久久久久久久| 两个人免费观看高清视频 | 久久久久久久久大av| 一区二区三区四区激情视频| 午夜精品国产一区二区电影| 老熟女久久久| 国产在线男女| 男女啪啪激烈高潮av片| 七月丁香在线播放| 久久99热这里只频精品6学生| 热re99久久精品国产66热6| 中国美白少妇内射xxxbb| 国产一区二区在线观看日韩| 一区二区三区四区激情视频| 国产亚洲av片在线观看秒播厂| 亚洲av日韩在线播放| 国产男人的电影天堂91| 一区二区三区精品91| 久久久久视频综合| 国语对白做爰xxxⅹ性视频网站| 又黄又爽又刺激的免费视频.| 国产伦理片在线播放av一区| 国产免费福利视频在线观看| 在线观看三级黄色| 王馨瑶露胸无遮挡在线观看| 中文字幕免费在线视频6| 精品人妻熟女毛片av久久网站| 午夜日本视频在线| 男女国产视频网站| 51国产日韩欧美| 久久这里有精品视频免费| 成人毛片a级毛片在线播放| 777米奇影视久久| 亚洲av免费高清在线观看| 精品亚洲成国产av| 美女脱内裤让男人舔精品视频| 成人国产麻豆网| 六月丁香七月| 高清毛片免费看| 女性生殖器流出的白浆| 爱豆传媒免费全集在线观看| 亚洲欧美清纯卡通| 在线观看一区二区三区激情| 久久99一区二区三区| 国产男女内射视频| 秋霞在线观看毛片| 免费看光身美女| 成人黄色视频免费在线看| 亚洲情色 制服丝袜| 国产黄片视频在线免费观看| 一级黄片播放器| 精品亚洲乱码少妇综合久久| 国产毛片在线视频| 国产精品福利在线免费观看| 高清不卡的av网站| 内地一区二区视频在线| 国产精品久久久久久久电影| 亚洲精品乱久久久久久| 国产黄色视频一区二区在线观看| 久久99热6这里只有精品| av视频免费观看在线观看| 我的老师免费观看完整版| a级毛片免费高清观看在线播放| 欧美精品亚洲一区二区| 午夜视频国产福利| 免费观看在线日韩| 亚洲av.av天堂| 一本大道久久a久久精品| 黄色欧美视频在线观看| 欧美精品人与动牲交sv欧美| 久久午夜综合久久蜜桃| 女的被弄到高潮叫床怎么办| 久久久国产精品麻豆| 国产男人的电影天堂91| 狂野欧美激情性bbbbbb| 99re6热这里在线精品视频| 免费播放大片免费观看视频在线观看| 精品国产国语对白av| 日韩精品免费视频一区二区三区 | 男女免费视频国产| 大片电影免费在线观看免费| 国产日韩欧美视频二区| 国产在线免费精品| 日韩精品有码人妻一区| 成人无遮挡网站| 乱人伦中国视频| 在线精品无人区一区二区三| 男女啪啪激烈高潮av片| 9色porny在线观看| 国产日韩欧美亚洲二区| 日韩av不卡免费在线播放| 一二三四中文在线观看免费高清| 99热这里只有是精品50| 亚洲av国产av综合av卡| 久久青草综合色| 一级爰片在线观看| 亚洲精品乱码久久久久久按摩| 观看美女的网站| 97在线人人人人妻| 男人舔奶头视频| 亚洲婷婷狠狠爱综合网| 亚洲第一av免费看| 久久久久久久久久久久大奶| 伦理电影免费视频| 中文字幕制服av| 欧美精品人与动牲交sv欧美| 免费大片18禁| 丝瓜视频免费看黄片| 寂寞人妻少妇视频99o| 黑人猛操日本美女一级片| 91精品国产九色| 肉色欧美久久久久久久蜜桃| av一本久久久久| 国产深夜福利视频在线观看| 在线观看av片永久免费下载| 亚洲精品国产成人久久av| 国产精品蜜桃在线观看| 99视频精品全部免费 在线| 各种免费的搞黄视频| 一级黄片播放器| 国产白丝娇喘喷水9色精品| 亚洲国产欧美在线一区| 91精品国产九色| 欧美三级亚洲精品| 欧美性感艳星| 99热这里只有精品一区| 欧美xxⅹ黑人| 日本av手机在线免费观看| av在线app专区| 午夜av观看不卡| 婷婷色综合www| 秋霞伦理黄片| 伊人亚洲综合成人网| 一级a做视频免费观看| 国产精品久久久久久精品古装| 建设人人有责人人尽责人人享有的| 国产av国产精品国产| 在线观看一区二区三区激情| 五月伊人婷婷丁香| 国产精品99久久99久久久不卡 | 卡戴珊不雅视频在线播放| 18禁动态无遮挡网站| 中文天堂在线官网| 免费观看无遮挡的男女| 中文字幕久久专区| 日本色播在线视频| a级毛色黄片| 中文天堂在线官网| 能在线免费看毛片的网站| 免费看光身美女| 国产亚洲午夜精品一区二区久久| 中文字幕免费在线视频6| 亚洲激情五月婷婷啪啪| 国产成人a∨麻豆精品| 在线观看免费视频网站a站| 少妇人妻 视频| 久久精品国产亚洲网站| 国产精品一区二区性色av| 久久99热这里只频精品6学生| 欧美97在线视频| 国产精品成人在线| 乱系列少妇在线播放| 亚洲成人av在线免费| 日韩成人av中文字幕在线观看| 99热这里只有是精品50| av免费观看日本| 精品人妻熟女av久视频| 99视频精品全部免费 在线| 91在线精品国自产拍蜜月| 久久久欧美国产精品| 最近最新中文字幕免费大全7| 亚洲国产欧美日韩在线播放 | 综合色丁香网| a级一级毛片免费在线观看| av免费在线看不卡| 18禁动态无遮挡网站| 国产永久视频网站| 亚洲国产最新在线播放| 免费看光身美女| 国产亚洲最大av| 免费人成在线观看视频色| 一级毛片 在线播放| 晚上一个人看的免费电影| freevideosex欧美| 婷婷色综合www| 久久久久久久久久久免费av| 久久青草综合色| 欧美人与善性xxx| 免费av中文字幕在线| 久久婷婷青草| 国产精品国产三级国产av玫瑰| 国产精品久久久久久精品古装| 最近中文字幕2019免费版| 国产综合精华液| 蜜桃在线观看..| 一边亲一边摸免费视频| 欧美97在线视频| 97超视频在线观看视频| 国产精品熟女久久久久浪| 免费观看在线日韩| 一级毛片我不卡| 18禁动态无遮挡网站| 日本免费在线观看一区| 美女内射精品一级片tv| 亚洲成色77777| 午夜老司机福利剧场| 欧美最新免费一区二区三区| 久久 成人 亚洲| 欧美老熟妇乱子伦牲交| 亚洲自偷自拍三级| 老熟女久久久| av福利片在线| 嘟嘟电影网在线观看| 国产极品天堂在线| 免费在线观看成人毛片| av在线观看视频网站免费| 国产精品一二三区在线看| 亚洲精品亚洲一区二区| 亚洲人与动物交配视频| 成人毛片a级毛片在线播放| 老司机亚洲免费影院| 插逼视频在线观看| 一本久久精品| 中文字幕人妻丝袜制服| 人妻人人澡人人爽人人| 99精国产麻豆久久婷婷| 亚洲成人手机| 亚洲成色77777| 王馨瑶露胸无遮挡在线观看| 黄色日韩在线| 18禁在线无遮挡免费观看视频| 国产男女超爽视频在线观看| 欧美日韩国产mv在线观看视频| 国产av码专区亚洲av| videossex国产| .国产精品久久| 日韩欧美精品免费久久| 丁香六月天网| 国内少妇人妻偷人精品xxx网站| 一级黄片播放器| 欧美日韩视频精品一区| a级毛色黄片| 日产精品乱码卡一卡2卡三| 久久精品夜色国产| 国产精品无大码| 日韩一区二区三区影片| 亚洲精品,欧美精品| 成人二区视频| 久久国产乱子免费精品| 极品教师在线视频| 亚洲国产av新网站| av在线app专区| av天堂中文字幕网| 一级毛片电影观看| 午夜福利,免费看| 啦啦啦视频在线资源免费观看| 高清午夜精品一区二区三区| 简卡轻食公司| 亚洲激情五月婷婷啪啪| 亚洲国产欧美日韩在线播放 | 国内精品宾馆在线| 天天躁夜夜躁狠狠久久av| 久久精品国产亚洲网站| 精品久久久久久久久亚洲| av线在线观看网站| 精品欧美一区二区三区在线| 久久毛片免费看一区二区三区| av在线播放精品| 精品国产乱码久久久久久男人| 国产精品99久久99久久久不卡| 久9热在线精品视频| 精品亚洲成国产av| 欧美另类一区| 女人久久www免费人成看片| 老司机深夜福利视频在线观看 | 亚洲精品一卡2卡三卡4卡5卡 | 多毛熟女@视频| 精品人妻熟女毛片av久久网站| 在线观看舔阴道视频| 天天操日日干夜夜撸| 天堂俺去俺来也www色官网| 极品人妻少妇av视频| h视频一区二区三区| 啦啦啦中文免费视频观看日本| 国产伦人伦偷精品视频| 亚洲三区欧美一区| 亚洲国产av影院在线观看| 精品视频人人做人人爽| 50天的宝宝边吃奶边哭怎么回事| 精品国产一区二区三区久久久樱花| 亚洲欧美精品综合一区二区三区| 69精品国产乱码久久久| 国产又爽黄色视频| 午夜免费鲁丝| 亚洲精品中文字幕一二三四区 | 悠悠久久av| 成人国语在线视频| 精品一品国产午夜福利视频| 精品一区二区三区av网在线观看 | 亚洲第一av免费看| www.精华液|