Ignore:
Timestamp:
Jan 2, 2015 11:35:51 AM (7 years ago)
Author:
boeske
Message:

several updates in the tutorial

File:
1 edited

Legend:

Unmodified
Added
Removed
  • palm/trunk/TUTORIAL/SOURCE/parallelization.tex

    r1226 r1515  
    66\usepackage{ngerman}
    77\usepackage{pgf}
    8 \usetheme{Dresden}
    98\usepackage{subfigure}
    109\usepackage{units}
     
    2726        \ttfamily,showstringspaces=false,captionpos=b}
    2827
    29 \institute{Institut fÃŒr Meteorologie und Klimatologie, Leibniz UniversitÀt Hannover}
     28\institute{Institute of Meteorology and Climatology, Leibniz UniversitÀt Hannover}
     29\selectlanguage{english}
    3030\date{last update: \today}
    3131\event{PALM Seminar}
     
    4747
    4848\title[Parallelization]{Parallelization}
    49 \author{Siegfried Raasch}
     49\author{PALM group}
    5050
    5151\begin{document}
     
    6262\begin{frame}
    6363   \frametitle{Basics of Parallelization}
    64    \tikzstyle{yellow} = [rectangle,  fill=yellow!20, text width=0.6\textwidth, font=\scriptsize]
     64   \tikzstyle{yellow} = [rectangle,  fill=yellow!20, text width=0.4\textwidth, font=\tiny]
    6565   \scriptsize
    6666   \textbf{Parallelization:}
    6767   \begin{itemize}
    68       \item<2-> All processor elements (PE, core) are carrying out the same program (SIMD).
     68      \item<2-> All processor elements (PE, core) are carrying out the same program code (SIMD).
    6969      \item<3-> Each PE of a parallel computer operates on a different set of data.
    7070   \end{itemize}
     
    8787               \node [yellow] (1) {%
    8888               \texttt{!\$OMP DO}\\
    89                \texttt{\quad \quad DO  i = 1, 100}\\
    90                \quad \quad \quad $\vdots$\\
    91                \texttt{\quad \quad ENDDO}};
     89               \texttt{DO  i = 1, 100}\\
     90               \quad $\vdots$\\
     91               \texttt{ENDDO}};
     92            \end{tikzpicture}
     93            \begin{tikzpicture}[auto, node distance=0]
     94               \node [yellow] (2) {%
     95               \texttt{!\$acc kernels}\\
     96               \texttt{DO  i = 1, 100}\\
     97               \quad $\vdots$\\
     98               \texttt{ENDDO}};
    9299            \end{tikzpicture}
    93100         \end{center}
     101         \vspace{-1mm}
    94102         \onslide<8-> parallelization can easily be done by the compiler, if all PEs have access to all variables (shared memory)\\
    95103         \onslide<10-> \textbf{shared memory model (OpenMP)}
     104         \onslide<10-> \textbf{accelerator model (OpenACC)}
    96105      \end{column}
    97106   \end{columns}
     
    113122         \node (sha_mem) at (3.5,-1) [text width=0.35\textwidth] {\footnotesize \textbf{shared} memory\\(SGI-Altix, multicore PCs)};
    114123         \onslide<7-> \node (MPI) at (-3.5,-3) [ellipse,fill=yellow!90] {MPI};
    115          \onslide<8-> \node (OpenMP) at (3.5,-3) [ellipse,fill=yellow!90] {OpenMP};         
     124         \onslide<8-> \node (OpenMP) at (3.5,-3) [ellipse,fill=yellow!90, text width=0.13\textwidth] {\footnotesize OpenMP OpenACC};         
    116125         \onslide<6-> \node (clustered_systems) at (0,-3) [draw, text width=0.15\textwidth] {clustered systems};
    117126         \node (cs_info) at (0,-4.2) [text width=0.4\textwidth] {\footnotesize (IBM-Regatta, Linux-Cluster,
     
    253262      \vspace{2mm}
    254263      \item<12-> OpenMP parallelization as well as mixed usage of OpenMP and
    255                     MPI is also possible.
     264                    MPI is also realized.
    256265   \end{itemize}
    257266\end{frame}
     
    299308            \item<1-> The parallel version of PALM is switched on by \texttt{mrun}-option ''\texttt{-K parallel}''. Additionally, the number of required processors and the number of tasks per node (number of PEs to be used on one node) have to be provided:\\
    300309                 \quad \texttt{mrun ... -K parallel -X64 -T8 ...}
    301                  \item<2-> From an accounting point of view, it is always most efficient to use all PEs of a node (\texttt{-T8}) (in case of a ''non-shared'' usage of nodes).
     310                 \item<2-> From an accounting point of view, it is always most efficient to use all PEs of a node (e.g. \texttt{-T8}) (in case of a ''non-shared'' usage of nodes).
    302311                 \item<3-> If a normal unix-kernel operating system (not a micro-kernel) is running on each CPU, then there migth be a speed-up of the code, if 1-2 PEs less than the total number of PEs on the node are used.
    303312                 \item<4-> On machines with a comparably slow network, a 1D-decomposition (along $x$) should be used, because then only two transpositions have to be carried out by the pressure solver. A 1D-decomposition is automatically used for NEC-machines (e.g.  \texttt{-h necriam}). The virtual processor grid to be used can be set manually by d3par-parameters \texttt{npex} and \texttt{npey}.
    304             \item<5-> Using the Open-MP parallelization does not yield any advantage over using a pure domain decomposition with MPI (contrary to expectations, it mostly slows down the computational speed), but this may change on cluster systems for very large number of processors ($>$10000?).\\       
     313            \item<5-> Using the Open-MP parallelization does not yield any advantage over using a pure domain decomposition with MPI (contrary to expectations, it mostly slows down the computational speed), but this may change on cluster systems for very large number of processors ($>$10000?) or with Intel-Xeon-Phi accelerator boards.\\       
    305314         \end{itemize}
    306315         \begin{center}
    307          \vspace{-7mm}
     316         \vspace{-3mm}
    308317         \onslide<4-> \includegraphics[width=0.13\textwidth]{parallelization_figures/folie_6.png}
    309318         \end{center}
     
    321330            \item<1-> MPI (message passing interface) is a portable interface for communication between PEs (FORTRAN or C library).
    322331            \vspace{2mm}
    323             \item<2-> To make MPI available on HLRN‘s SGI-ICE, the module \texttt{mpt} must be loaded by setting the \texttt{\%modules} option  in .mrun.config appropriately:
    324 
    325                  \quad \texttt{\%modules   ...:mpt:...}
    326             \vspace{2mm}
    327                  \item<3-> The path to the MPI-library may have to be given in the compiler call, by setting an appropriate option in the configuration file .mrun.config:
    328 
    329                  \quad \texttt{\%lopts  -r8:-nbs:\textcolor{blue}{-L:<replace by mpi library path>:-lmpi}}
    330             \vspace{2mm}
    331                  \item<4-> All MPI calls must be within\\
     332            \item<2-> MPI on the Cray-XC30 of HLRN-III is provided with module \texttt{PrgEnv-cray} which is loaded by default.
     333            \vspace{2mm}
     334                 \item<3-> All MPI calls must be within\\
    332335                 \quad \texttt{CALL MPI\_INIT( ierror )}\\
    333336                 \quad $\vdots$\\
     
    399402            \tiny
    400403            \vspace{2mm}
    401             \quad \texttt{u(:,:,nxl\textcolor{blue}{-ngl}), u(:,:,nxr\textcolor{blue}{+ngl})    ! left and right boundary}\\
    402             \quad \texttt{u(:,nys\textcolor{blue}{-ngl},:), u(:,nyn\textcolor{blue}{+ngl},:)    ! south and north boundary}\\
    403             \vspace{4mm}
     404            \quad \texttt{u(:,:,nxl\textcolor{blue}{-nbgp}), u(:,:,nxr\textcolor{blue}{+nbgp}) ! left and right boundary}\\
     405            \quad \texttt{u(:,nys\textcolor{blue}{-nbgp},:), u(:,nyn\textcolor{blue}{+nbgp},:) ! south and north boundary}\\
     406            \vspace{1mm}
     407            \scriptsize The actual code uses \texttt{\textcolor{blue}{nxlg}=nxl\textcolor{blue}{-nbgp}}, etc...\\
     408            \vspace{2mm}
    404409            \item<2-> \scriptsize The exchange of ghost points is done in file \texttt{exchange\underline{\ }horiz.f90}\\
    405410            \textbf{\underline{Simplified} example:} synchroneous exchange of ghost points along $x$ ($yz$-planes, send left, receive right plane):\\
    406411            \tiny
    407412            \vspace{2mm}
    408             \quad \texttt{CALL MPI\underline{\ }SENDRECV( ar(nzb,nys-\textcolor{blue}{ngl},nxl),   ngp\underline{\ }yz, MPI\underline{\ }REAL, pleft,  0,}\\
    409             \quad \texttt{\hspace{9.5em}ar(nzb,nys-\textcolor{blue}{ngl},nxr+1), ngp\underline{\ }yz, MPI\underline{\ }REAL, pright, 0,}\\
     413            \quad \texttt{CALL MPI\underline{\ }SENDRECV( ar(nzb,\textcolor{blue}{nysg},nxl),   ngp\underline{\ }yz, MPI\underline{\ }REAL, pleft,  0,}\\
     414            \quad \texttt{\hspace{9.5em}ar(nzb,\textcolor{blue}{nysg},nxr+1), ngp\underline{\ }yz, MPI\underline{\ }REAL, pright, 0,}\\
    410415            \quad \texttt{\hspace{9.5em}comm2d, status, ierr )}\\
    411             \vspace{4mm}
     416            \vspace{2mm}
    412417            \item<3-> \scriptsize In the real code special MPI data types (vectors) are defined for exchange of $yz$/$xz$-planes for performance reasons and because array elements to be exchanged are not consecutively stored in memory for $xz$-planes:\\
    413418            \tiny
    414419            \vspace{2mm}
    415             \quad \texttt{ngp\underline{\ }yz(0) = (nzt - nzb + 2) * (nyn - nys + 1 + 2 * \textcolor{blue}{ngl} )}\\
    416             \quad \texttt{CALL MPI\underline{\ }TYPE\underline{\ }VECTOR( \textcolor{blue}{ngl}, ngp\underline{\ }yz(0), ngp\underline{\ }yz(0), MPI\underline{\ }REAL, type\underline{\ }yz(0), ierr )}\\
     420            \quad \texttt{ngp\underline{\ }yz(0) = (nzt - nzb + 2) * (nyn - nys + 1 + 2 * \textcolor{blue}{nbgp} )}\\
     421            \quad \texttt{CALL MPI\underline{\ }TYPE\underline{\ }VECTOR( \textcolor{blue}{nbgp}, ngp\underline{\ }yz(0), ngp\underline{\ }yz(0), MPI\underline{\ }REAL, type\underline{\ }yz(0), ierr )}\\
    417422            \quad \texttt{CALL MPI\underline{\ }TYPE\underline{\ }COMMIT( type\underline{\ }yz(0), ierr )   ! see file init\underline{\ }pegrid.f90}\\
    418423            \ \\
    419             \quad \texttt{CALL MPI\underline{\ }SENDRECV( ar(nzb,nys-ngl,nxl), 1, type\underline{\ }yz(grid\underline{\ }level), pleft, 0, ...}\\
     424            \quad \texttt{CALL MPI\underline{\ }SENDRECV( ar(nzb,\textcolor{blue}{nysg},nxl), 1, type\underline{\ }yz(grid\underline{\ }level), pleft, 0, ...}\\
    420425         \end{itemize}       
    421426\end{frame}
     
    432437            \item<2-> The following example is for a transposition from $x$ to $y$, i.e. for the input array all data elements along $x$ reside on the same PE, while after the transposition, all elements along $y$ are on the same PE:\\
    433438            \ \\
    434             \scriptsize
     439            \tiny
    435440            \texttt{!}\\
    436441            \texttt{!--   in SUBROUTINE transpose\underline{\ }xy:}\\
    437            \texttt{CALL MPI\underline{\ }ALLTOALL( f\underline{\ }inv(nys\underline{\ }x,nzb\underline{\ }x,0), sendrecvcount\underline{\ }xy, MPI\underline{\ }REAL, \&}\\
    438            \texttt{\hspace{9.5em}work(1), \hspace{6.5em}sendrecvcount\underline{\ }xy, MPI\underline{\ }REAL, \&}\\
     442           \texttt{CALL MPI\underline{\ }ALLTOALL( f\underline{\ }inv(nys\underline{\ }x,nzb\underline{\ }x,0), \hspace{1em}sendrecvcount\underline{\ }xy, MPI\underline{\ }REAL, \&}\\
     443           \texttt{\hspace{9.5em}work(1,nzb\underline{\ }y, nxl\underline{\ }y,0), sendrecvcount\underline{\ }xy, MPI\underline{\ }REAL, \&}\\
    439444           \texttt{\hspace{9.5em}comm1dy, ierr )}\\
    440445           \ \\
    441            \item<3-> The data resorting before and after the calls of MPI\_ALLTOALL is highly optimized to account for the different processor architectures.
     446           \item<3-> The data resorting before and after the calls of MPI\_ALLTOALL is highly optimized to account for the different processor architectures and even allows for overlapping communication and calculation.
    442447         \end{itemize}
    443448      \end{column}
     
    494499%Folie 14
    495500\begin{frame}
     501   \frametitle{PALM Parallel I/O for 2D/3D Data with netCDF4/HDF5}
     502   \footnotesize
     503   \begin{itemize}
     504      \item<1-> The Cray XC30 of HLRN-III allows direct parallel I/O to a netCDF file
     505      \vspace{2mm}
     506      \item<2-> modules \texttt{cray\_hdf5\_parallel} and \texttt{cray\_netcdf\_hdf5parallel} have to be loaded
     507      \vspace{2mm}
     508      \item<3-> cpp-switches \texttt{-D\_\_netcdf}, \texttt{-D\_\_netcdf4}, \texttt{-D\_\_netcdf4\_parallel} have to be set
     509      \vspace{2mm}
     510      \item<4-> Both is done in the default HLRN-III block of the configuration file (\texttt{lccrayh})
     511      \vspace{2mm}
     512      \item<5-> \texttt{d3par}-parameter \texttt{netcdf\_data\_format=5} has to be set in the parameter file
     513      \vspace{2mm}
     514      \item<6-> \texttt{combine\_plot\_fields.x} is not required in this case
     515   \end{itemize}
     516\end{frame}
     517
     518%Folie 15
     519\begin{frame}
    496520   \frametitle{Performance Examples (I)}
    497521   \begin{itemize}
     
    520544\end{frame}
    521545
    522 %Folie 15
     546%Folie 16
    523547\begin{frame}
    524548   \frametitle{Performance Examples (II)}
     
    535559         \begin{column}{0.5\textwidth}
    536560            \vspace{35mm}
    537             \onslide<2-> currently largest simulation feasible on that system:\\
     561            \onslide<2-> largest simulation feasible on that system:\\
    538562            \ \\
    539563            $4096^3$ grid points
     
    542566\end{frame}
    543567
    544 %Folie 16
     568%Folie 17
    545569\begin{frame}
    546570   \frametitle{Performance Examples (III)}
Note: See TracChangeset for help on using the changeset viewer.