30 | | '''Results:''' \\ |
31 | | .6 pgf90 without any acc kernels \\ |
32 | | .31 last acc version \\ |
33 | | .32 ifort (on bora) using acc-branch \\ |
34 | | .34 ifort (on bora) using vector-branch \\\\ |
| 37 | r1111 \\ |
| 38 | Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres. \\ |
| 39 | CUDA fft has been implemented. \\ |
| 40 | GPU can also been used in the single-core (non-MPI-parallel) version. |
| 41 | |
| 42 | '''Results for 512x512x64 grid (time in micro-s per gridpoint and timestep):''' \\ |
| 43 | ||.1 ||2*Tesla, quadcore, pgi ||0.32053 || |
| 44 | ||.2 ||1*Tesla, single-core (no MPI), pgi ||0.54789 || |
| 45 | ||.3 ||quadcore, pgi ||0.78343 || |
| 46 | ||.4 ||quadcore, intel (on bora, cache-v) ||0.82395 || |
38 | | * porting the Poisson solver following Klaus' suggestions (there is still a bug in his last version), implement fast tridiagonal solver for GPU |
39 | | * creating a single core version (without using MPI, so that host-device transfer is minimized) |
40 | | * testing the PGI 12.6 compiler version, porting of flow_statistics if reduction is implemented, check the capability of parallel regions |
41 | | * update ghost boundaries only, overlapping of update/MPI and computation? |
42 | | * overlapping communication |
| 50 | * testing the newest PGI 13.2 compiler version, porting of reduction operations (especially in flow_statistics), check the capability of parallel regions |
| 51 | * update ghost boundaries only, overlapping of update/MPI and computation |
| 52 | * remove host/device data transfer for the single-core version, still required for the cyclic boundary conditions, in order to run the code completely on one GPU |
| 53 | * overlapping communication in pressure solver (alltoall operations) |