Changes between Version 3 and Version 4 of doc/tec/gpu


Ignore:
Timestamp:
Mar 9, 2013 1:27:18 AM (12 years ago)
Author:
raasch
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • doc/tec/gpu

    v3 v4  
    1 == Porting the code to NVidia GPU using the OpenACC programming model
     1== Porting the code to NVidia GPU using the openACC programming model
     2
     3Currently, PALM-GPU usage has following restrictions / requirements:
     4* 2d domain decomposition (or 1PE, single-core)
     5* cyclic lateral boundary conditions
     6* no humidity / cloud physics
     7* no topography
     8* no Lagrangian particle model
    29
    310Tests can be done on host {{{inferno}}} only, using the PGI-FORTRAN compiler. Required settings:
     
    1219.../trunk/SCRIPTS/.mrun.config.imuk_gpu
    1320}}}
    14 Please note settings of cpp-directives.\\
     21Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\
    1522Test parameter set:
    1623{{{
    1724/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
    1825}}}
    19 Please note that {{{loop_optimization = 'acc'}}} has to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.
     26Please note that {{{loop_optimization = 'acc'}}} and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.
    2027
    2128'''Report on current activities:'''
     
    2835measurements with Intel compiler on {{{inferno}}} still have to be carried out
    2936
    30 '''Results:''' \\
    31 .6   pgf90 without any acc kernels \\
    32 .31  last acc version \\
    33 .32  ifort (on bora) using acc-branch \\
    34 .34  ifort (on bora) using vector-branch \\\\
     37r1111 \\
     38Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres. \\
     39CUDA fft has been implemented. \\
     40GPU can also been used in the single-core (non-MPI-parallel) version.
     41
     42'''Results for 512x512x64 grid (time in micro-s per gridpoint and timestep):''' \\
     43||.1 ||2*Tesla, quadcore, pgi             ||0.32053 ||
     44||.2 ||1*Tesla, single-core (no MPI), pgi ||0.54789 ||
     45||.3 ||quadcore, pgi                      ||0.78343 ||
     46||.4 ||quadcore, intel (on bora, cache-v) ||0.82395 ||
    3547
    3648'''Next steps:'''
    3749
    38 * porting the Poisson solver following Klaus' suggestions (there is still a bug in his last version), implement fast tridiagonal solver for GPU
    39 * creating a single core version (without using MPI, so that host-device transfer is minimized)
    40 * testing the PGI 12.6 compiler version, porting of flow_statistics if reduction is implemented, check the capability of parallel regions
    41 * update ghost boundaries only, overlapping of update/MPI and computation?
    42 * overlapping communication
     50* testing the newest PGI 13.2 compiler version, porting of reduction operations (especially in flow_statistics), check the capability of parallel regions
     51* update ghost boundaries only, overlapping of update/MPI and computation
     52* remove host/device data transfer for the single-core version, still required for the cyclic boundary conditions, in order to run the code completely on one GPU
     53* overlapping communication in pressure solver (alltoall operations)
    4354* porting of remaining things (averaging, I/O, etc.)
    4455* ...