Changes between Version 18 and Version 19 of doc/tec/gpu

Feb 10, 2016 8:30:14 AM (9 years ago)



  • doc/tec/gpu

    v18 v19  
    77* no canopy model
    88* no Lagrangian particle model
     9* random number generator needs to be ported
     10* most of I/O does not work
     11* MG solver does not work / needs to be ported
    1013Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:
    3134%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk parallel pgigpu146
    33 The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\
     36Please note settings of cpp-directives ({{{-D__nopointer -D__openacc -D__cuda_fft}}} + CUDA library path in {{{lopts}}}).\\The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\
    3538It might be necessary to load the modules manually before calling mbuild or mrun:
    4245export PGI_ACC_NOSYNCQUEUE=1
    44 before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence!
     47before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence! \\
    4649Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model:
    5760In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue.
     62The workaround worked, e.g. for 14.6, but maybe not for 14.10?
     63 \\
    6065A test parameter-set can be found here:
     69Please note that {{{loop_optimization = 'acc'}}} and {{{psolver = 'poisfft'}}} have to be set. {{{fft_method = 'system-specific'}}} is required to switch on the CUDA-fft. All other fft-methods do not run on the GPU, i.e. they are extremely slow. \\ \\
    65 Here are some hints for running the single-GPU (no-MPI) version:\\
    66 Compiler settings are given in
     71mrun-command to run on two GPU-devices:
    68 .../trunk/SCRIPTS/.mrun.config.imuk_gpu
    69 .../trunk/INSTALL/
     73mrun -d acc_medium -h lcmuk -K "parallel pgigpu146" -X2 -T2 -r "d3#"
    71 Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\
    72 Test parameter set:
     75 \\
     77Runs on a single GPU without MPI (i.e. no domain decomposition) require this configuration:
    74 /home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
    75 /home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
     79%compiler_name     pgf90                                                       lcmuk pgigpu146
     80%compiler_name_ser pgf90                                                       lcmuk pgigpu146
     81%cpp_options       -Mpreprocess:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  lcmuk pgigpu146
     82%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0  lcmuk pgigpu146
     83%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk pgigpu146
    77 Please note that {{{loop_optimization = 'acc'}}}, {{{psolver = 'poisfft'}}}, and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.
     86Run it with
     88mrun -d acc_medium -K pgigpu146 -r "d3#"
     90 \\
     92Some PGI-compiler options for debugging:
     94-O0 -C -g -Mbounds -Mchkstk -traceback
     96 \\
    7998'''Report on current activities:'''
    120 '''work packages fpr the EuroHack:'''
     139'''work packages / questions for the EuroHack:'''
    122141* getting the CUDA-aware MPI to run: for this routines {{{time_integration}}} and {{{exchange_horiz}}} in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the {{{MPI_ALLTOALL}}} in {{{transpose.f90}}} CUDA-aware. This should be very easy. Just add (e.g.) {{{host_data use_device( f_inv, work )}}} clauses in front of the {{{MPI_ALLTOALL}}} calls and remove the existing {{{update host}}} and {{{data copyin}}} clauses. Also, {{{update host}}} and {{{update device}}} clauses for array {{{ar}}} have to be removed in {{{poisfft}}}.
    128147* Routine {{{surface_layer_fluxes}}}: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
    129148* Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}?
    130 * Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU?
     149* Routine {{{timestep}}}: FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, do not run on GPU. Is there a better way to realize these functions than the openacc-workaround that is programmed in routine {{{timestep}}}?
     150* General question: Is it (meanwhile) possible to run code in parallel on the CPU and the GPU at the same time, or - in other words - to run something "in background" on the GPU?
    132 '''Things that still need to be ported:'''
    133 * multigrid-solver
    134 * cloud physics
    135 * stuff related with non-cyclic BC
    136 * random-number generator
    137 * the complete LPM