Parallelization and optimization
The parallelization of the code is achieved by a 2D domain decomposition method along the x and y direction with equally sized subdomains. The method has not been changed in general since the formulation given by Raasch and Schröter (2001). In the following we will show that this method still allows for sufficient scalability on up to 50\,000 processor cores (also referred to as processor elements, PEs). Ghost layers are added at the side boundaries of the subdomains in order to account for the local data dependencies, which are caused by the need to compute finite differences at these positions. The number of ghost layers that are used in PALM depend on the order of the advection scheme, with three layers for the 5thorder WickerSkamarock scheme and one layer for the 2ndorder PiacsekWilliams scheme. Ghost layer data are exchanged after every time step. An anticyclic index order (i.e., (k, j, i)) is chosen for the 3D arrays in order to speed up the data exchange. The anticyclic order guarantees that the ghost layer data are stored as long consecutive blocks in the memory, which allows to access them in the fastest way.
The solution of the Poisson equation is complicated by the 2D decomposition, because nonlocal data dependencies appear in all three directions, if the equation is solved with the FFTmethod (see Sect. pressure solver). Due to the domain decomposition, the processor elements cannot perform standard FFTs along x or y direction as their memory contains only a part of the full data. The general method to overcome the problem is to reorder the 3Dpressure/divergence data among the PEs using a transposition technique described in Raasch and Schröter (2001). The transposition is done using the MPIroutine MPI_ALLTOALL and requires an additional resorting of the data in the local memory of the PEs before and after MPI_ALLTOALL is called. A similar method with MPI_ALLTOALL, replaced by MPI_SENDRECV, has been recently presented by Sullivan and Patton (2011).
Only local data dependencies appear if the Poisson equation is solved with the multigrid scheme. However, this method requires frequent exchange of ghost layers during every iteration step of the SORsolver, as well as for the restriction and prolongation step. The amount of ghost layer data rapidly decreases for the coarser grid levels, so that the MPItransfer time may become latency bound. The domain decomposition effects the coarsening of grid levels at the point, when the subdomain array of the current grid level contains only a single grid point along one of the spatial directions. In case that the number of grid points of the total (coarsened) domain allows further coarsening, array data from all subdomains are gathered and further processed on the main PE (hereafter PE 0), and results are redistributed to the other PEs in the respective prolongation step. However, this method is very inefficient and not used by default. Instead, coarsening is just stopped at that level, where subdomains contain only two grid points along at least one of the three spatial directions. The precision of the multigrid method depends on the iteration count. Using two Wcycles and two SOR iterations for each grid level typically reduces the velocity divergence by about 4 orders of magnitude, which turned out to be sufficient in most of our applications. With these settings, and for a numerical grid of about 2000^{3} gridpoints, the multigrid method requires about the same time as the FFT Poissonsolver, and for larger grids it is even faster than the FFT solver.
The scaling behavior of PALM 4.0 is presented in Fig. 15a for a test case with 2160^{3} grid points and the FFT Poisson solver. Tests have been performed on the CrayXC40 of the NorthGerman Computing Alliance (HLRN). The machine has 1128 compute nodes, each equipped with two 12core IntelHaswell CPUs, plus 744 compute nodes equipped with two 12core IntelIvy Bridge CPUs, and an Ariesinterconnect. Additionally, runs with 4320^{3} grid points have been carried out with up to 43200 cores, starting with a minimum of 11520 cores (see Fig. 15b). Runs with less cores could not be carried out as the data would not have fit into the memory.
Figure 15: Scalability of PALM 4.0 on the Cray XC30 supercomputer of HLRN. Simulations were performed with a computational grid of (a) 2160^{3} and (b) 4320^{3} grid points (IntelIvy Bridge CPUs). (a) shows data for up to 11520 PEs with cache (red lines) and vector (blue lines) optimization and overlapping during the computation (FFT and tridiagonal equation solver, see this section) enabled (dashed green lines). Measurement data are shown for the total CPU time (crosses), the prognostic equations (circles), and for the pressure solver (boxes). (b) shows data for up to 43200 PEs and with both cache optimization and overlapping enabled. Measurement data is shown for the total CPU time (gray line), pressure solver (blue line), prognostic equations (red line), as well as the MPI calls MPI_ALLTOALL (brown line) and MPI_SENDRCV (purple line).
Ideally, for a socalled strong scaling test, where the same setup is run on different numbers of cores, the wallclock time of a run should decrease by a factor of two if the core number is doubled, which is shown in Fig. 15a and b (black lines). Figure 15b shows that the code scales very well up to 20000 cores and still acceptable for larger numbers (gray line). The decreasing scalability for larger core numbers is mainly caused by a performance drop of the MPI_ALLTOALL routine (brown line). In contrast, the pure computational part, i.e., the calculation of the prognostic equations (red line), scales up perfectly to the maximum number of cores.
While the general parallelization methods used in version 4.0 do not differ from the first version, a large number of code optimizations have been carried out since then. Only the most important ones shall be briefly discussed at this point, namely the scalar optimization for different processor types; and overlapping of computation and interprocessor communication.
The original PALM code calculated the different contributions to the tendency terms (i.e., advection, buoyancy, diffusion, etc.) and the final prognostic equation for each prognostic quantity in separate 3Dloops over the three spatial directions. In case of large 3Darrays that do not fit into the cache of cache based processors like IntelXeon or AMDAthlon, the array data has to be reloaded from the main memory for each 3Dloop, which is extremely time consuming. For this reason, the outer loops over i and j have been extracted from each 3Dloop, now forming a 2Dloop over all tendencies and prognostic equations, e.g.:
In this way, array data used in the first loop can be reused from the cache by the following loops, since the size of 1Ddata columns along k is usually small enough to fit completely into the cache. Figure 15a shows that this loopstructure gives a performance gain for the computation of the prognostic equations of 40 % compared with the 3D loop structure. The overall performance of the code improves by about 15 %. Nonetheless, both methods are implemented in the code in separate branches, since the 3D loops give a much better performance on vector based hardware like NECSX or accelerator boards (e.g., Nvidia K20), since they allow the compilers to generate much longer vectors than for the single loops along the zdirection.
From Fig. 15b it is evident that for largesize setups with huge number of grid points and more than a few thousand of PEs, the solution of the Poisson equation dominates the time consumption of the simulation. This is because the FFT and the data transpositions with MPI_ALLTOALL scale less well than the other parts of the code. The FFT time increases nonlinearly with N log(N),where N is the total number of grid points along x or y. The MPI_ALLTOALL time also increases nonlinearly with the number of cores. While the scaling problem is not easy to address, the Poisson solver can be sped up by overlapping the computation (FFT and tridiagonal equation solver) and the communication among the PEs (MPI_ALLTOALL). PALM solves the Poisson equation in different steps in a sequential order. So far, first, the complete 3D data subdomain array is transposed, followed by the FFT along x, followed by the next transposition, followed by the FFT along y, etc. The FFT calculation cannot start unless the complete 3D array is transposed. Now, in PALM 4.0, the 3D arrays are processed in 2D slices (e.g. in zx slices for the transposition from z to x).The slices are processed in a loop along the remaining direction (which is y in this case) with alternating transfer (MPI_ALLTOALL of a 2D slice) and computation (FFT of this slice). This allows some overlapping of the computation and communication parts because of the following reason: on the Cray XC30 system mentioned above, for example, every compute node is populated with two processor dies, containing an Intel CPU with 12 cores each. This allows 24 MPI processes on the node. However, every node is equipped with two MPI channels only. If a data transfer is issued on all MPI processes simultaneously, the transfer cannot be done totally in parallel because individual MPI processes have to wait for the free channel. This behavior allows computation and transfer in parallel. For example, if PE 0 is the first to get a free MPI channel, it can start computation as soon as the transfer has been finished. All other PEs consecutively start computation after transfer. When the last transfer is finished, PE 0 has already finished computation and can immediately start the next transfer. The fact that not all PEs have simultaneous access to an MPI channel allows for parallel transfer and computation without any extra programming effort, such as asynchronous MPI_ALLTOALL or doing the transfer using hyperthreads.
Breaking up the workload as described above also improves performance due to better cache utilization, because the transposed data are still in the cache when needed by the FFT. The difference in performance between the sequential and the overlapping method are displayed in Fig. 15a. The FFTPoissonsolver is sped up about 15 % for up to 2500 cores in case that overlapping is used. For higher core numbers, the current realization of overlapping becomes inefficient because the data chunks handled by MPI_ALLTOALL get too small. The latency thus dominates the transfer time. In order to overcome this problem, several 2D slices can be transposed at the same time. We will implement this technique in one of the next PALM revisions in the near future.
Beside the parallelization by domain decomposition, PALM is also fully parallelized on looplevel using the sharedmemory OpenMP programming model and can be run in socalled hybrid mode, e.g., with two MPI processes and $12$ OpenMP threads per MPI process started on each node.
A typical PALM setup uses 2D domain decomposition with one MPI process on every processor core. The hybrid mode normally does not give advantages, because the OpenMP parallelization creates another synchronization level so that the total computation time will not decrease (typically it even increases by a few percent). Anyhow, for the following special cases hybrid parallelization may have some advantages:
Since the speedup behavior depends on many factors like the problem size, the virtual 2D processor grid, the network, etc., the actual speedup is difficult to predict and should be tested individually for every setup.
The data exchange in case of coupled oceanatmosphere runs is realized with MPI. There are two methods available. The first one, based on MPI1, splits the default communicator (MPI_COMM_WORLD) into two parts, with the respective number of PEs assigned to the atmosphere and ocean part as given by external parameters. The second method starts the respective number of MPI tasks for the atmosphere and the ocean independently (e.g., by two calls of mpiexec) and is using MPI2 routines MPI_COMM_CONNECT and MPI_COMM_ACCEPT to couple them.
References
 Raasch S, Schröter M. 2001. PALM  a largeeddy simulation model performing on massively parallel computers. Meteorol. Z. 10: 363–372.
 Sullivan PP, Patton EG. 2011. The effect of mesh resolution on convective boundary layer statistics and structures generated by largeeddy simulation. J. Atmos. Sci. 68: 2395–2415.
Attachments (1)

12.png
(958.2 KB) 
added by Giersch 4 years ago.
Scalability of PALM 4.0 on the Cray XC30 supercomputer of HLRN.
Download all attachments as: .zip