Version 13 (modified by raasch, 9 years ago) (diff) |
---|
Porting the code to NVidia GPU using the openACC programming model
Currently, PALM-GPU usage has following restrictions / requirements:
- 2d domain decomposition (or 1PE, single-core)
- cyclic lateral boundary conditions
- no humidity / cloud physics
- no canopy model
- no Lagrangian particle model
Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:
ssh -X <your cluster username>@login.cluster.uni-hannover.de
After login, switch to our GPU-node by starting an interactive job:
qsub -X -I -l nodes=1:ppn=2,walltime=10000 -l mem=2GB -W x=PARTITION:muk
The session uses 2 Intel-CPU-cores for 10000 seconds. You will need two cores if you want to run PALM on both of the two K40 boards.
Tests can be done on host inferno only, using the PGI-FORTRAN compiler. Required settings:
module load pgi-compiler/2013-136
Compiler settings are given in
.../trunk/SCRIPTS/.mrun.config.imuk_gpu .../trunk/INSTALL/MAKE.inc.pgi.openacc
Please note settings of cpp-directives (-D__openacc -D__cuda_fft + CUDA library path).
Test parameter set:
/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d /home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
Please note that loop_optimization = 'acc', psolver = 'poisfft', and fft_method = 'system-specific' have to be set. Results of tests are stored in the respective MONITORING directory.
Report on current activities:
r1015
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported
additional versions for tendency subroutines have been created (..._acc)
statistics are not ported at all
speedup seems to be similar to what have been reported by Klaus Ketelsen
measurements with Intel compiler on inferno still have to be carried out
r1111
Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres.
CUDA fft has been implemented.
GPU can also been used in the single-core (non-MPI-parallel) version.
r1113
In single-core mode, lateral boundary conditions completely run on device. Most loops in pres ported. Vertical boundary conditions (boundary_conds) ported.
r1221
Reduction operations in pres and flow_statistics ported.
Results for 256x256x64 grid (time in micro-s per gridpoint and timestep):
.1 | 1*Tesla, single-core (no MPI), pgi13.6 | 0.33342 | r1221 |
.2 | single-core (no MPI), pgi13.6 (cache, Temperton) | 2.34144 | r1221 |
The initialization time of the GPU (power up) can be avoided by running /muksoft/packages/pgi/2013-136/linux86-64/13.6/bin/pgcudainit in background.
For current PGI compiler version 13.6, use "-ta=nocache" and set environment variable PGI_ACC_SYNCHRONOUS=1. Otherwise, there will be a significant loss in performance (factor of two!).
Next steps:
- porting of MIN/MAXLOC operations in (timestep, porting of disturb_fields (implement parallel random number generator)
- check the capability of parallel regions (can IF-constructs be removed from inner loops?)
- for MPI mode update ghost boundaries only, overlapping of update/MPI-transfer and computation
- overlapping communication in pressure solver (alltoall operations)
- porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
- ...