Home

Context Navigation

← Previous Change
Next Change →

parallelization.tex

Timestamp:

Sep 18, 2013 1:19:19 PM (11 years ago)

Author:

fuhrmann

Message:

several updates in the tutorial

File:

: 1 edited

palm/trunk/TUTORIAL/SOURCE/parallelization.tex (modified) (14 diffs)

Legend:

: Unmodified
: Added
: Removed

palm/trunk/TUTORIAL/SOURCE/parallelization.tex

-                      r973
+                      r1226
 \usepackage{tikz}
 \usetikzlibrary{shapes,arrows,positioning}
 \usetikzlibrary{decorations.markings}             %neues paket
 \usetikzlibrary{decorations.pathreplacing}        %neues paket
+%\usetikzlibrary{decorations.markings}             %neues paket
+%\usetikzlibrary{decorations.pathreplacing}        %neues paket
 \def\Tiny{\fontsize{4pt}{4pt}\selectfont}
 \usepackage{amsmath}
 …
          \onslide<5->each PE solves the equations for a different subdomain of the total domain
          \begin{center}
             \includegraphics[width=0.5\textwidth]{parallelization_figures/subdomain.png}
+            \includegraphics[width=0.3\textwidth]{parallelization_figures/subdomain_folie2.png}
          \end{center}
          \onslide<7->each PE only knows the variable values from its subdomain, communication / data exchange between PEs is necessary\\
 …
          \node (center) at (0,1) {};
          \onslide<2-> \node (Network) at (-3.5,1) [draw, ellipse,fill=green!20] {Network};
          \node (dis_mem) at (-3.5,-1) [text width=0.28\textwidth] {\footnotesize \textbf{distributed} memory\\(Cray-T3E)};
+         \node (dis_mem) at (-3.5,-1) [text width=0.28\textwidth] {\footnotesize \textbf{distributed} memory\\(Cray-XC30)};
          \onslide<3-> \node (add_mem) at (3.5,1) [rectangle, draw] {adressable memory};
          \node (sha_mem) at (3.5,-1) [text width=0.35\textwidth] {\footnotesize \textbf{shared} memory\\(SGI-Altix, multicore PCs)};
 …
          \onslide<6-> \node (clustered_systems) at (0,-3) [draw, text width=0.15\textwidth] {clustered systems};
          \node (cs_info) at (0,-4.2) [text width=0.4\textwidth] {\footnotesize (IBM-Regatta, Linux-Cluster,
 NEC-SX, SGI-ICE, Cray-XE6)};
+            NEC-SX, SGI-ICE, Cray-XC)};
 % Adressable memory node (big)
 …
    \vspace{2mm}
    \begin{itemize}
       \item<10-> Alternatively, a 1D-decomposition along $x$ or $y$ may be used in case of slow networks, but this generally doesn't scale for processor numbers $>$ 256.
+      \item<10-> Alternatively, a 1D-decomposition along $x$ or $y$ may be used.
       \vspace{2mm}
       \item<11-> Message passing is realized using MPI.
       \vspace{2mm}
       \item<12-> OpenMP parallelization as well as mixed usage of OpenMP and
+MPI is also possible. (OpenMP tests and optimization is under way)
+                    MPI is also possible.
    \end{itemize}
 \end{frame}
 …
          \onslide<5-> \includegraphics[width=0.8\textwidth]{parallelization_figures/fft.png} \end{center}
          \vspace{-4mm}
          \textbf{Example: transpositions for solving the Poisson\\ \hspace{4em}equation}
+         \textbf{Example: transpositions for solving the Poisson\\ \hspace{4.1em}equation}
       \end{column}
    \end{columns}
 …
                  \item<3-> If a normal unix-kernel operating system (not a micro-kernel) is running on each CPU, then there migth be a speed-up of the code, if 1-2 PEs less than the total number of PEs on the node are used.
                  \item<4-> On machines with a comparably slow network, a 1D-decomposition (along $x$) should be used, because then only two transpositions have to be carried out by the pressure solver. A 1D-decomposition is automatically used for NEC-machines (e.g.  \texttt{-h necriam}). The virtual processor grid to be used can be set manually by d3par-parameters \texttt{npex} and \texttt{npey}.
             \item<6-> Using the Open-MP parallelization does not yield any advantage over using a pure domain decomposition with MPI (contrary to expectations, it mostly slows down the computational speed), but this may change on cluster systems for very large number of processors ($>$10000?).\\
+            \item<5-> Using the Open-MP parallelization does not yield any advantage over using a pure domain decomposition with MPI (contrary to expectations, it mostly slows down the computational speed), but this may change on cluster systems for very large number of processors ($>$10000?).\\
          \end{itemize}
          \begin{center}
          \vspace{-7mm}
          \onslide<5-> \includegraphics[width=0.13\textwidth]{parallelization_figures/folie_6.png}
+         \onslide<4-> \includegraphics[width=0.13\textwidth]{parallelization_figures/folie_6.png}
          \end{center}
       \end{column}
 …
                  \quad \texttt{\%modules   ...:mpt:...}
             \vspace{2mm}
                  \item<2-> The path to the MPI-library may have to be given in the compiler call, by setting an appropriate option in the configuration file .mrun.config:
                  \quad \texttt{\%lopts  -axW:-cpp:-r8:-nbs:-Vaxlib:\textcolor{blue}{-L:<replace by mpi library path>:-lmpi}}
+                 \item<3-> The path to the MPI-library may have to be given in the compiler call, by setting an appropriate option in the configuration file .mrun.config:
+                 \quad \texttt{\%lopts  -r8:-nbs:\textcolor{blue}{-L:<replace by mpi library path>:-lmpi}}
             \vspace{2mm}
                  \item<3-> All MPI calls must be within\\
+                 \item<4-> All MPI calls must be within\\
                  \quad \texttt{CALL MPI\_INIT( ierror )}\\
                  \quad $\vdots$\\
 …
             \quad \texttt{CALL MPI\underline{\ }TYPE\underline{\ }COMMIT( type\underline{\ }yz(0), ierr )   ! see file init\underline{\ }pegrid.f90}\\
             \ \\
             \quad \texttt{CALL MPI\underline{\ }SENDRECV( ar(nzb,nys-ngl,nxl), type\underline{\ }yz(grid\underline{\ }level), MPI\underline{\ }REAL, pleft, 0, ...}\\
+            \quad \texttt{CALL MPI\underline{\ }SENDRECV( ar(nzb,nys-ngl,nxl), 1, type\underline{\ }yz(grid\underline{\ }level), pleft, 0, ...}\\
          \end{itemize}
 \end{frame}
 …
          \onslide<4-> \textbf{General comment:}
          \begin{itemize}
             \item Parallel I/O on a large number of files ($>$1000) currently may cause severe file system problems (e.g. on Lustre file systems).\\ \textbf{Workaround:} reduce the maximum number of parallel I/O streams\\ \hspace{5.75em}(see \texttt{mrun}-options)
+            \item Parallel I/O on a large number of files ($>$1000) currently may cause severe file system problems (e.g. on Lustre file systems).\\ \textbf{Workaround:} reduce the maximum number of parallel I/O streams\\ \hspace{5.75em}(see \texttt{mrun}-option \texttt{-w})
          \end{itemize}
       \end{column}
 …
 % Folie 13
+%Folie 13
 \begin{frame}
    \frametitle{PALM Parallel I/O for 2D/3D Data}
 …
 \end{frame}
 % Folie 14
+%Folie 14
 \begin{frame}
    \frametitle{Performance Examples (I)}
 …
 \end{frame}
 % Folie 15
+%Folie 15
 \begin{frame}
    \frametitle{Performance Examples (II)}
 …
 \end{frame}
+%Folie 16
+\begin{frame}
+   \frametitle{Performance Examples (III)}
+   \begin{itemize}
+      \item Simulation with $2160^3$ grid points  ($\sim$ 2 TByte memory)
+   \end{itemize}
+      \begin{columns}[T]
+         \begin{column}{0.5\textwidth}
+            \includegraphics[scale=0.3]{parallelization_figures/perf_4.png} \\
+            \scriptsize
+            \quad Cray-XC30, HLRN-III, Hannover\\
+            \quad (2D-domain decomposition)
+         \end{column}
+         \begin{column}{0.5\textwidth}
+            \vspace{35mm}
+            \onslide<2-> currently largest simulation feasible on that system:\\
+            \ \\
+            $5600^3$ grid points
+         \end{column}
+      \end{columns}
+\end{frame}
 \end{document}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 1226 for palm/trunk/TUTORIAL/SOURCE/parallelization.tex

Legend:

palm/trunk/TUTORIAL/SOURCE/parallelization.tex

Download in other formats: