source: palm/trunk/DOC/app/chapter_3.3.html @ 17

Last change on this file since 17 was 5, checked in by raasch, 18 years ago

html-documentation added

File size: 16.7 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<html>
3<head>
4  <meta http-equiv="CONTENT-TYPE"
5 content="text/html; charset=windows-1252">
6  <title>PALM chapter 3.3</title>
7  <meta name="GENERATOR" content="StarOffice 7  (Win32)">
8  <meta name="AUTHOR" content="Marcus Oliver Letzel">
9  <meta name="CREATED" content="20040728;14053490">
10  <meta name="CHANGED" content="20041112;14150257">
11  <meta name="KEYWORDS" content="parallel LES model">
12  <style>
13        <!--
14                @page { size: 21cm 29.7cm }
15        -->
16        </style>
17</head>
18<body dir="ltr" lang="en-US">
19<h3 style="line-height: 100%;">3.3 Initialization and restart runs</h3>
20<p style="line-height: 100%;">A job started by <b>mrun</b> will
21- according to its requested computing time, its memory size
22requirement and
23the number of necessary processing elements (on parallel computers) -
24be queued by the queuing-system of the remote computer into a suitable
25job
26class which fulfills these requirements. Each job class permits only
27jobs with certain maximum requirements (e.g.
28the job class <tt><font style="font-size: 11pt;" size="2">cdev</font></tt>
29on the IBM Regatta "hanni" of the HLRN permits only jobs with no more
30than 7200 seconds required computing time and with using no more than
3132
32processing elements). The job classes are important for the scheduling
33process of&nbsp; the computer. Jobs with small requirements usually
34come to execution
35very fast, jobs with higher requirements must wait longer (sometimes
36several days). </p>
37<p style="line-height: 100%;">Before the start of a model run the user
38must estimate how much CPU time the model will need for the simulation.
39The necessary time in seconds has to be indicated with the mrun
40<b>option</b> <tt><a
41 href="http://www.muk.uni-hannover.de/institut/software/mrun_beschreibung.html#Opt-t">-t</a></tt>
42and has an influence on the job class into which the job is queued. Due
43to the fact that the model usually uses a variable
44time step and thus the number of time steps to be executed and
45consequently the time needed by the model is not
46known at the beginning, this can be measured only very roughly in
47many cases. So it may happen that the model needs more time than
48indicated for the option <tt><u><font style="font-size: 10pt;" size="2">-t</font></u>,</tt>
49which normally leads to an abort of the job as soon as the available
50CPU time is consumed. In principle one could solve this problem by
51setting a very generously estimated value for <u><font
52 style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-t</font></font></u>,
53but this will possibly lead to the disadvantage that the queued job has
54to wait longer for execution.<br>
55</p>
56<p style="line-height: 100%;">To avoid this problem <b>mrun </b>offers
57the possibility of so-called <b>restart runs</b>. During the model
58run PALM continuously examines how much time is left for the
59execution of the job. If the run is not completed and finished shortly
60before
61expiration of this time, the model stops and writes down the values
62of (nearly) all model variables in binary form to a file (local name
63<a href="chapter_3.4.html#BINOUT">BINOUT</a>).
64After copying the output files required by the user, <b>mrun</b>
65automatically starts a restart run. For this purpose a new <b>mrun</b>
66call is set off automatically on the local computer of the user; <b>mrun</b>
67thus calls itself. The options with this call correspond to a large
68extent to those which the user had selected with his initial call of <b>mrun</b>.
69The model restarts and this time at the beginning it reads in the
70binary data written before and continues the run with them. If in
71this job the CPU time is not sufficient either, in order to terminate
72the run, at the end of the job another restart run is started, etc.,
73until the time which shall be simulated by the model, is reached.
74Thus a set of restart runs can develop - a so-called job chain. The
75first run of this chain (model start at t=0) is called
76<b>initial run</b>. </p>
77<p style="line-height: 100%;">Working with restart runs and their
78generation through <b>mrun</b> requires certain entries in the
79mrun-configuration file and in the parameter file, which are
80described and explained in the following. The configuration file must
81contain the following entries (example for the IBM Regatta of the
82HLRN): </p>
83<ul>
84  <pre style="line-height: 100%;"><font style="font-size: 10pt;"
85 size="2">%write_binary    true    restart</font><br><font
86 style="font-size: 10pt;" size="2">#</font><br><a
87 href="chapter_3.4.html#PARIN"><font style="font-size: 10pt;" size="2">PARIN</font></a><font
88 style="font-size: 10pt;" size="2">   in:job:npe   d3#   ~/palm/current_version/JOBS/$fname/INPUT    _p3d</font><br><font
89 style="font-size: 10pt;" size="2">PARIN   in:job:npe   d3f   ~/palm/current_version/JOBS/$fname/INPUT    _p3df</font><br><a
90 href="chapter_3.4.html#BININ"><font style="font-size: 10pt;" size="2">BININ</font></a><font
91 style="font-size: 10pt;" size="2">   in:loc       d3f   ~/palm/current_version/JOBS/$fname/OUTPUT   _d3d</font><br><font
92 style="font-size: 10pt;" size="2">#</font><br><a
93 href="chapter_3.4.html#BINOUT"><font style="font-size: 10pt;" size="2">BINOUT</font></a><font
94 style="font-size: 10pt;" size="2">  out:loc   restart  ~/palm/current_version/JOBS/$fname/OUTPUT   _d3d</font></pre>
95</ul>
96<p style="line-height: 100%;">The <b>mrun</b> call for the
97initialization run of the job chain must look as follows: </p>
98<ul>
99  <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font
100 style="font-size: 10pt;" size="2">mrun -h ibmh -d abcde -t 900 -r "d3# restart"</font></pre>
101</ul>
102<p style="line-height: 100%;">The specification of the environment
103variable <tt><tt><font style="font-size: 10pt;" size="2">writ</font></tt></tt><tt><tt><font
104 style="font-size: 10pt;" size="2">e_binary</font><font
105 style="font-size: 11pt;" size="2">, </font></tt></tt>which must be
106assigned the value <tt><tt><font style="font-size: 10pt;" size="2">true</font></tt></tt>,
107is essential. Only in this case the model writes
108binary-coded data for a possible restart run to the local file <tt><tt><a
109 href="chapter_3.4.html#BINOUT">BINOUT</a></tt></tt>
110at the end of the run. Then of course this output file must be stored
111on a permanent file with an appropriate file connection statement
112(last line of the example above). As you can see, both instructions
113(variable declaration and connection statements) are only carried out
114by <b>mrun</b>, if the character string <tt><tt><font
115 style="font-size: 10pt;" size="2">restart</font></tt></tt>
116is given for the option <tt><font style="font-size: 10pt;" size="2">-r</font>
117</tt>in the <span style="font-weight: bold;">mrun</span> call. Thus
118the example above can also be used
119if no restart runs are intended. In such cases the character string
120<tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt>
121with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt>
122can simply be omitted. </p>
123<p style="line-height: 100%;">Only by the specification of
124<tt><font style="font-size: 10pt;" size="2">write_binary=true</font><font
125 style="font-size: 11pt;" size="2">
126</font><font face="Thorndale, serif">the</font></tt> model is
127instructed to compute the remaining CPU time after each time step and
128stop, if the run is not going to be completed and finished briefly
129before expiration of
130this time. Actually the stop takes place when the
131difference from the available job time (determined by the <b>mrun</b>
132option <tt><font style="font-size: 10pt;" size="2">-t</font></tt>) and
133the time used so far by the job becomes smaller than the time given
134by the model variable <a
135 href="chapter_4.2.html#termination_time_needed">termination_time_needed</a>.
136With the variable <b>termination_time_needed </b>the user determines,
137how much time is needed for binary copying of the data for restart
138runs, as
139well as for the following data archiving and transfer of result data
140etc. (as long as this is part of the job). Thus, as soon as the
141remaining job time is less than <b>termination_time_needed</b>, the
142model stops
143the time step procedure and copies the data for a restart run to the
144local binary file BINOUT. The so-called initialization parameters are
145also written to this file (see <a href="chapter_4.0.html">chapter
1464.0</a>). In a last step the model produces another file with the
147local name CONTINUE_RUN. The presence of this file signals <b>mrun</b>
148the fact that a restart run must be started and leads to the
149start of an appropriate job. </p>
150<p style="line-height: 100%;"><font color="#000000">During the initial
151phase of a restart ru</font>n different actions than during the initial
152phase of an initial run of the model are neccessary. In this
153case the model must read in the binary data written by the preceding
154run at the beginning of the run. Beyond that it also reads the
155initialization parameters from this file. Therefore these do not need
156to be indicated in the parameter file (local name <a
157 href="chapter_3.4.html#PARIN">PARIN</a>).
158If they are indicated nevertheless and if their value deviates from
159their value of the initial run, then this is ignored. There is
160exactly one exception to this rule: with the help of the
161initialization parameter <a
162 href="chapter_4.1.html#initializing_actions">initializing_actions</a>
163it is determined whether the job is a restart run or an
164initial run. If <b>initializing_actions</b> =
165“<i>read_restart_data”</i>, then it is a restart
166run, otherwise an initial run. The previous remarks make it
167clear that the model obviously needs two different parameter files
168(local name PARIN) for the case of job chains. One is needed for the
169initial run and contains all initialization parameters set by
170the user and the other one is needed for restart runs. The
171last one only contains the initialization parameter
172<b>initializing_actions</b> (also, initialization
173parameters with values different from the initial run may appear in
174this file, but they will be ignored), which
175must have the value “<i>read_restart_data”</i>.
176Therefore the user must produce two different parameter files if he
177wants to operate job chains. Since the model always expects the
178parameter file on the local file <tt>PARIN</tt>, two different file
179connection statements must be given for this file in the
180configuration file. One may be active only at the initial run,
181the other one only at restart runs. The <b>mrun </b>call for the
182initial run shown above activates the first of the two
183specified connection statements, because the character string <tt><font
184 style="font-size: 10pt;" size="2">d3#</font></tt>
185with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt>
186coincides with the character
187string in the third column of the connection statement. Obviously
188the next statement must be active</p>
189<ul>
190  <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font
191 style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font></pre>
192</ul>
193<p style="line-height: 100%;">with the restart runs. Given that t<font
194 color="#000000">his statement only gets</font> active if the option <tt><font
195 style="font-size: 10pt;" size="2">-r</font></tt> is given the value
196<tt><font style="font-size: 11pt;" size="2">d3f</font></tt> and that
197the <b>mrun</b> call for this restart run is produced
198automatically (thus not by the user), <b>mrun</b> obviously has to
199replace <tt><font style="font-size: 10pt;" size="2">"d3#"</font></tt>
200of the initial run with <tt><tt><font style="font-size: 10pt;" size="2">"d3f"</font>
201</tt></tt>within the call of this restart run. Actually, with restart
202runs all <tt><font style="font-size: 10pt;" size="2">"#"</font></tt>
203characters within the strings given for the options <tt><font
204 style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font
205 style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">
206, </font></font><tt><font style="font-size: 10pt;" size="2"><font
207 face="Cumberland, monospace">-i</font></font></tt>
208and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> are
209replaced by <tt><font style="font-size: 10pt;" size="2">“f”</font></tt>.
210</p>
211<p style="line-height: 100%;">For example, for the initial run
212the permanent file </p>
213<ul>
214  <pre style="margin-bottom: 0.5cm; line-height: 100%;">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d</pre>
215</ul>
216<p style="line-height: 100%;">and for restart runs the permanent file<span
217 style="font-family: monospace;"> </span></p>
218<ul style="font-family: monospace;">
219  <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font
220 style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3df</font></pre>
221</ul>
222<p style="line-height: 100%;">is used. Only with restart runs the
223local file <tt>BININ</tt> is made available as input file, because
224the appropriate file connection statement also contains the
225character string <tt><font style="font-size: 10pt;" size="2">"d3f"</font></tt>
226in the third column. This is logical and necessary since in BININ the
227binary data, produced by the model of the preceding job of the chain,
228are expected and the initial run does not need these
229data The permanent names of this input file (local name BININ) and
230the corresponding output file (local name BINOUT) are identical and
231read </p>
232<ul>
233  <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font
234 style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/OUTPUT/abcde_d3d.</font></pre>
235</ul>
236<p style="line-height: 100%;">However, after the file produced by the
237previous job was read in by the model and after at the local file
238<tt>BINOUT </tt>was produced at the end of the job, the
239restart job does not overwrite this permanent file (<tt>…/<font
240 style="font-size: 10pt;" size="2">abcde_d3d</font></tt>)
241with the new data. Instead of that, it is examined whether already
242a permanent file with the name <tt><font style="font-size: 10pt;"
243 size="2">…/abcde_d3d</font>
244<font face="Thorndale, serif">exists </font></tt>when copying the
245output file (<tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>)
246of <b>mrun</b>. If this is the case, <tt><font
247 style="font-size: 10pt;" size="2">BINOUT</font></tt>
248is copied to the file<font style="font-size: 10pt;" size="2"><font
249 face="Cumberland, monospace">
250</font></font><tt><font style="font-size: 10pt;" size="2"><font
251 face="Cumberland, monospace">…/abcde_d3d.1</font></font></tt><font
252 style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">.</font></font>
253Even if this file is already present, <tt><font
254 style="font-size: 10pt;" size="2">…/abcde_d3d.2</font></tt>
255is tried etc. For an input file the highest existing cycle
256of the respective permanent file is copied. In the example above this
257means: the initial run creates the permanent file
258<tt><font style="font-size: 10pt;" size="2">…/abcde_d3d</font></tt><font
259 style="font-size: 11pt;" size="2">,</font>
260the first restart run uses this file and creates <tt>…/<font
261 style="font-size: 10pt;" size="2">abcde_d3d.1</font></tt>,
262the second restart run creates <tt><font style="font-size: 10pt;"
263 size="2">…/abcde_d3d.2</font></tt><font style="font-size: 10pt;"
264 size="2">
265</font>etc. After completion of the job chain the user can still
266access all files created by the jobs. This makes it possible for the
267user for example to restart the model run of a certain job of the job
268chain again. </p>
269<p style="line-height: 100%;">Therefore restart jobs can not only be
270started automatically through <b>mrun</b>, but also manually by the
271user. This is necessary e.g. whenever after the end of a job chain
272it is decided that the simulation must be continued further, because
273the phenomenon which should be examined did not reach the desired
274state yet. In such cases the <b>mrun</b> options completely
275correspond to those of the initial call; simply the <tt><font
276 style="font-size: 10pt;" size="2">"#"</font></tt> characters in the
277arguments of options <tt><font style="font-size: 10pt;" size="2"><font
278 face="Cumberland, monospace">-r</font></font></tt><font
279 style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">,
280</font></font><tt><font style="font-size: 10pt;" size="2"><font
281 face="Cumberland, monospace">-i</font></font></tt>
282and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> must be
283replaced by <tt><font style="font-size: 10pt;" size="2">"f"</font></tt>.
284</p>
285<hr>
286<p style="line-height: 100%;"><br>
287<font color="#000080"><font color="#000080"><a href="chapter_3.2.html"><font
288 color="#000080"><img src="left.gif" name="Grafik1" align="bottom"
289 border="2" height="32" width="32"></font></a><a href="index.html"><font
290 color="#000080"><img src="up.gif" name="Grafik2" align="bottom"
291 border="2" height="32" width="32"></font></a><a href="chapter_3.4.html"><font
292 color="#000080"><img src="right.gif" name="Grafik3" align="bottom"
293 border="2" height="32" width="32"></font></a></font></font></p>
294<p style="line-height: 100%;"><i>Last change:&nbsp;</i> 14/04/05 (SR)</p>
295</body>
296</html>
Note: See TracBrowser for help on using the repository browser.