= Parallelization and optimization =

The parallelization of the code is achieved by a 2-D domain decomposition method along the ''x'' and ''y'' direction with equally sized subdomains. The method has not been changed in general since the formulation given by [#raasch2001 Raasch and Schröter (2001)]. In the following we will show that this method still allows for sufficient scalability on up to 50\,000 processor cores (also referred to as processor elements, PEs). Ghost layers are added at the side boundaries of the subdomains in order to account for the local data dependencies, which are caused by the need to compute finite differences at these positions. The number of ghost layers that are used in PALM depend on the order of the advection scheme, with three layers for the 5th-order Wicker-Skamarock scheme and one layer for the 2nd-order
Piacsek--Williams scheme. Ghost layer data are exchanged after every time step. An anti-cyclic index order (i.e., '''''(k, j, i)''''') is
chosen for the 3-D arrays in order to speed up the data exchange. The anti-cyclic order guarantees that the ghost layer data are stored as
long consecutive blocks in the memory, which allows to access them in the fastest way.

The solution of the Poisson equation is complicated by the 2-D decomposition, because non-local data dependencies appear in all three directions, if the equation is solved with the FFT-method (see Sect. [wiki:/doc/tec/pres pressure solver]). Due to the domain decomposition, the processor elements cannot perform standard FFTs along ''x'' or ''y'' direction as their memory contains only a part of the full data. The
general method to overcome the problem is to re-order the 3-D-pressure/divergence data among the PEs using a transposition technique described in [#raasch2001 Raasch and Schröter (2001)]. The transposition is done using the MPI-routine '''''MPI_ALLTOALL''''' and requires an
additional re-sorting of the data in the local memory of the PEs before and after '''''MPI_ALLTOALL''''' is called. A similar method with '''''MPI_ALLTOALL''''', replaced by '''''MPI_SENDRECV''''', has been recently presented by [#sullivan2011 Sullivan and Patton (2011)].

Only local data dependencies appear if the Poisson equation is solved with the multigrid scheme. However, this method requires frequent exchange of ghost layers during every iteration step of the SOR-solver, as well as for the restriction and prolongation step. The amount of ghost layer data rapidly decreases for the coarser grid levels, so that the MPI-transfer time may become latency bound. The domain decomposition effects the coarsening of grid levels at the point, when the subdomain array of the current grid level contains only a single grid point along one of the spatial directions. In case that the number of grid points of the total (coarsened) domain allows further coarsening, array data from all subdomains are gathered and further processed on the main PE (hereafter PE 0), and results are redistributed to the other PEs in the respective prolongation step. However, this method is very inefficient and not used by default. Instead, coarsening is just stopped at that level, where subdomains contain only two grid points along at least one of the three spatial directions. The precision of the multigrid method depends on the iteration count. Using two W-cycles and two SOR iterations for each grid level typically reduces the velocity divergence by about 4 orders of magnitude, which turned out to be sufficient in most of our applications. With these settings, and for a numerical grid of about ''2000^3^'' gridpoints, the multigrid method requires about the same time as the FFT Poisson-solver, and for larger grids it is even faster than the FFT solver.

The scaling behavior of PALM 4.0 is presented in Fig. 15a for a test case with ''2160^3^'' grid points and the FFT Poisson solver. Tests have been performed on the Cray-XC40 of the North-German Computing Alliance (HLRN). The machine has ''1128'' compute nodes, each equipped with two 12-core Intel-Haswell CPUs, plus 744 compute nodes equipped with two 12-core Intel-Ivy Bridge CPUs, and an Aries-interconnect. Additionally, runs with ''4320^3^'' grid points have been carried out with up to ''43200'' cores, starting with a minimum of ''11520'' cores (see Fig. 15b). Runs with less cores could not be carried out as the data would not have fit into the memory.

[[Image(12.png, 600px, border=1)]]

Figure 15: Scalability of PALM 4.0 on the Cray XC30 supercomputer of HLRN. Simulations were performed with a computational grid of '''(a)''' ''2160^3^'' and '''(b)''' ''4320^3^'' grid points (Intel-Ivy Bridge CPUs). '''(a)''' shows data for up to ''11520'' PEs with cache (red lines) and vector (blue lines) optimization and overlapping during the computation (FFT and tri-diagonal equation solver, see this section) enabled (dashed green lines). Measurement data are shown for the total CPU time (crosses), the prognostic equations (circles), and for the pressure solver (boxes). '''(b)''' shows data for up to 43200 PEs and with both cache optimization and overlapping enabled. Measurement data is shown for the total CPU time (gray line), pressure solver (blue line), prognostic equations (red line), as well as the MPI calls '''''MPI_ALLTOALL''''' (brown line) and '''''MPI_SENDRCV''''' (purple line).

Ideally, for a so-called strong scaling test, where the same setup is run on different numbers of cores, the wallclock time of a run should
decrease by a factor of two if the core number is doubled, which is shown in Fig. 15a and b (black lines). Figure 15b shows that the code scales very well up to ''20000'' cores and still acceptable for larger numbers (gray line). The decreasing scalability for larger core numbers is mainly caused by a performance drop of the '''''MPI_ALLTOALL''''' routine (brown line). In contrast, the pure computational part, i.e., the calculation of the prognostic equations (red line), scales up perfectly to the maximum number of cores.

While the general parallelization methods used in version 4.0 do not differ from the first version, a large number of code optimizations
have been carried out since then. Only the most important ones shall be briefly discussed at this point, namely the scalar optimization for
different processor types; and overlapping of computation and inter-processor communication.

The original PALM code calculated the different contributions to the tendency terms (i.e., advection, buoyancy, diffusion, etc.) and the
final prognostic equation for each prognostic quantity in separate 3-D-loops over the three spatial directions. In case of large 3-D-arrays 
that do not fit into the cache of cache based processors like Intel-Xeon or AMD-Athlon, the array data has to be reloaded from the main memory for each 3-D-loop, which is extremely time consuming. For this reason, the outer loops over ''i'' and ''j'' have been extracted from each 3-D-loop, now forming a 2-D-loop over all tendencies and prognostic equations, e.g.:
{{{
#!Latex
\begin{verbatim}
DO  i = nxl, nxr
   DO  j = nys, nyn

      DO  k = nzb+1, nzt
!--      advection term
         tend(k,j,i) =...
      ENDDO

      DO  k = nzb+1, nzt
!--      diffusion term
         tend(k,j,i) = tend(k,j,i) +...
      ENDDO

...   ! further tendencies

   ENDDO
ENDDO
\end{verbatim}
}}}
In this way, array data used in the first loop can be re-used from the cache by the following loops, since the size of 1-D-data columns along
''k'' is usually small enough to fit completely into the cache. Figure 15a shows that this loop-structure gives a performance gain for the computation of the prognostic equations of 40 % compared with the 3-D loop structure. The overall performance of the code improves by about 15 %. Nonetheless, both methods are implemented in the code in separate branches, since the 3-D loops give a much better performance on vector based hardware like NEC-SX or accelerator boards (e.g., Nvidia K20), since they allow the compilers to generate much longer vectors than for the single loops along the ''z''-direction.

From Fig. 15b it is evident that for large-size setups with huge number of grid points and more than a few thousand of PEs, the solution of the Poisson equation dominates the time consumption of the simulation. This is because the FFT and the data transpositions with '''''MPI_ALLTOALL''''' scale less well than the other parts of the code. The FFT time increases nonlinearly with ''N log(N)'',where ''N''
is the total number of grid points along ''x'' or ''y''. The '''''MPI_ALLTOALL''''' time also increases nonlinearly with the number of cores. While the scaling problem is not easy to address, the Poisson solver can be sped up by overlapping the computation (FFT and tri-diagonal equation solver) and the communication among the PEs ('''''MPI_ALLTOALL'''''). PALM solves the Poisson equation in different steps in a sequential order. So far, first, the complete 3-D data subdomain array is transposed, followed by the FFT along ''x'', followed by the next transposition, followed by the FFT along ''y'', etc. The FFT calculation cannot start unless the complete 3-D array is transposed. Now, in PALM 4.0, the 3-D arrays are processed in 2-D slices (e.g. in ''z''-''x'' slices for the transposition from ''z'' to ''x'').The slices are processed in a loop along the remaining direction (which is ''y'' in this case) with alternating transfer ('''''MPI_ALLTOALL''''' of a 2-D slice) and computation (FFT of this slice). This allows some overlapping of the computation and communication parts because of the following reason: on the Cray XC30 system mentioned above, for example, every compute node is populated with two processor dies, containing an Intel CPU with 12 cores each. This allows 24 MPI processes on the node. However, every node is equipped with two MPI channels only. If a data transfer is issued on all MPI processes simultaneously, the transfer cannot be done totally in parallel because individual MPI processes have to wait for the free channel. This behavior allows computation and transfer in parallel. For example, if PE 0 is the first to get a free MPI channel, it can start computation as soon as the transfer has been finished. All other PEs consecutively start computation after transfer. When the last transfer is finished, PE 0 has already finished computation and can immediately start the next transfer. The fact that not all PEs have simultaneous access to an MPI channel allows for parallel transfer and computation without any extra programming effort, such as asynchronous '''''MPI_ALLTOALL''''' or doing the transfer using hyperthreads.

Breaking up the workload as described above also improves performance due to better cache utilization, because the transposed data are still
in the cache when needed by the FFT. The difference in performance between the sequential and the overlapping method are displayed in
Fig. 15a. The FFT-Poisson-solver is sped up about 15 % for up to ''2500'' cores in case that overlapping is used. For higher core numbers, the current realization of overlapping becomes inefficient because the data chunks handled by '''''MPI_ALLTOALL''''' get too small. The latency thus dominates the transfer time. In order to overcome this problem, several 2-D slices can be transposed at the same time. We will implement this technique in one of the next PALM revisions in the near future.

Beside the parallelization by domain decomposition, PALM is also fully parallelized on loop-level using the shared-memory OpenMP programming
model and can be run in so-called hybrid mode, e.g., with two MPI processes and $12$ OpenMP threads per MPI process started on each node.

A typical PALM setup uses 2-D domain decomposition with one MPI process on every processor core. The hybrid mode normally does not give advantages, because the OpenMP parallelization creates another synchronization level so that the total computation time will not
decrease (typically it even increases by a few percent). Anyhow, for the following special cases hybrid parallelization may have some advantages:
{{{
#!Latex
\begin{itemize}
\item with many processor cores per CPU, a~1-D domain decomposition
  plus OpenMP parallelization may show better performance because the
  number of transpositions is reduced from 6 to 2,
\item since the hybrid mode enlarges the subdomain sizes (because of
  less MPI processes), load imbalance problems caused by
  e.g. inhomogeneously distributed buildings or clouds may be reduced,
  because larger subdomains provide a~better chance to get about the
  same number of buildings/clouds per subdomain,
\item for the multigrid Poisson solver the hybrid mode allows to
  generate more grid levels on the subdomains because of their larger
  size, which may help to improve the solver convergence.
\end{itemize}
}}}
Since the speedup behavior depends on many factors like the problem size, the virtual 2-D processor grid, the network, etc., the actual speedup is difficult to predict and should be tested individually for every setup.

The data exchange in case of coupled ocean-atmosphere runs is realized with MPI. There are two methods available. The first one, based on MPI-1, splits the default communicator ('''''MPI_COMM_WORLD''''') into two parts, with the respective number of PEs assigned to the atmosphere and ocean part as given by external parameters. The second method starts the respective number of MPI tasks for the atmosphere and the ocean independently (e.g., by two calls of '''''mpiexec''''') and is using MPI-2 routines '''''MPI_COMM_CONNECT''''' and '''''MPI_COMM_ACCEPT''''' to couple them.

== References ==
* [=#raasch2001]'''Raasch S, Schröter M.''' 2001. PALM - a large-eddy simulation model performing on massively parallel computers. Meteorol. Z. 10: 363–372.

* [=#sullivan2011]'''Sullivan PP, Patton EG.''' 2011. The effect of mesh resolution on convective boundary layer statistics and structures generated by large-eddy simulation. J. Atmos. Sci. 68: 2395–2415.