source: palm/trunk/DOC/app/chapter_3.3.html @ 1411

Last change on this file since 1411 was 62, checked in by raasch, 18 years ago

Id string added to all html files

  • Property svn:keywords set to Id
File size: 16.6 KB
RevLine 
[5]1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
[62]2<html><head>
3<meta http-equiv="CONTENT-TYPE" content="text/html; charset=windows-1252"><title>PALM
4chapter 3.3</title> <meta name="GENERATOR" content="StarOffice 7 (Win32)"> <meta name="AUTHOR" content="Marcus Oliver Letzel"> <meta name="CREATED" content="20040728;14053490"> <meta name="CHANGED" content="20041112;14150257"> <meta name="KEYWORDS" content="parallel LES model"> <style>
5<!--
6@page { size: 21cm 29.7cm }
7-->
8</style></head>
9
10<body style="direction: ltr;" lang="en-US"><h3 style="line-height: 100%;">3.3 Initialization and restart
11runs</h3>
12<p style="line-height: 100%;">A job started by <b>mrun</b>
13will
[5]14- according to its requested computing time, its memory size
15requirement and
16the number of necessary processing elements (on parallel computers) -
17be queued by the queuing-system of the remote computer into a suitable
18job
19class which fulfills these requirements. Each job class permits only
20jobs with certain maximum requirements (e.g.
21the job class <tt><font style="font-size: 11pt;" size="2">cdev</font></tt>
22on the IBM Regatta "hanni" of the HLRN permits only jobs with no more
23than 7200 seconds required computing time and with using no more than
2432
25processing elements). The job classes are important for the scheduling
26process of&nbsp; the computer. Jobs with small requirements usually
27come to execution
28very fast, jobs with higher requirements must wait longer (sometimes
29several days). </p>
[62]30<p style="line-height: 100%;">Before the start of a model
31run the user
[5]32must estimate how much CPU time the model will need for the simulation.
33The necessary time in seconds has to be indicated with the mrun
[62]34<b>option</b> <tt><a href="http://www.muk.uni-hannover.de/institut/software/mrun_beschreibung.html#Opt-t">-t</a></tt>
[5]35and has an influence on the job class into which the job is queued. Due
36to the fact that the model usually uses a variable
37time step and thus the number of time steps to be executed and
38consequently the time needed by the model is not
39known at the beginning, this can be measured only very roughly in
40many cases. So it may happen that the model needs more time than
41indicated for the option <tt><u><font style="font-size: 10pt;" size="2">-t</font></u>,</tt>
42which normally leads to an abort of the job as soon as the available
43CPU time is consumed. In principle one could solve this problem by
[62]44setting a very generously estimated value for <u><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-t</font></font></u>,
[5]45but this will possibly lead to the disadvantage that the queued job has
46to wait longer for execution.<br>
[62]47</p><p style="line-height: 100%;">To avoid this
48problem <b>mrun </b>offers
49the possibility of so-called <b>restart runs</b>. During
50the model
[5]51run PALM continuously examines how much time is left for the
52execution of the job. If the run is not completed and finished shortly
53before
54expiration of this time, the model stops and writes down the values
55of (nearly) all model variables in binary form to a file (local name
56<a href="chapter_3.4.html#BINOUT">BINOUT</a>).
57After copying the output files required by the user, <b>mrun</b>
58automatically starts a restart run. For this purpose a new <b>mrun</b>
59call is set off automatically on the local computer of the user; <b>mrun</b>
60thus calls itself. The options with this call correspond to a large
61extent to those which the user had selected with his initial call of <b>mrun</b>.
62The model restarts and this time at the beginning it reads in the
63binary data written before and continues the run with them. If in
64this job the CPU time is not sufficient either, in order to terminate
65the run, at the end of the job another restart run is started, etc.,
66until the time which shall be simulated by the model, is reached.
67Thus a set of restart runs can develop - a so-called job chain. The
68first run of this chain (model start at t=0) is called
69<b>initial run</b>. </p>
[62]70<p style="line-height: 100%;">Working with restart runs
71and their
72generation through <b>mrun</b> requires certain entries in
73the
[5]74mrun-configuration file and in the parameter file, which are
75described and explained in the following. The configuration file must
76contain the following entries (example for the IBM Regatta of the
77HLRN): </p>
[62]78<ul> <pre style="line-height: 100%;"><font style="font-size: 10pt;" size="2">%write_binary true restart</font><br><font style="font-size: 10pt;" size="2">#</font><br><a href="chapter_3.4.html#PARIN"><font style="font-size: 10pt;" size="2">PARIN</font></a><font style="font-size: 10pt;" size="2"> in:job:npe d3# ~/palm/current_version/JOBS/$fname/INPUT _p3d</font><br><font style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font><br><a href="chapter_3.4.html#BININ"><font style="font-size: 10pt;" size="2">BININ</font></a><font style="font-size: 10pt;" size="2"> in:loc d3f ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font><br><font style="font-size: 10pt;" size="2">#</font><br><a href="chapter_3.4.html#BINOUT"><font style="font-size: 10pt;" size="2">BINOUT</font></a><font style="font-size: 10pt;" size="2"> out:loc restart ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font></pre></ul>
79<p style="line-height: 100%;">The <b>mrun</b>
80call for the
[5]81initialization run of the job chain must look as follows: </p>
[62]82<ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">mrun -h ibmh -d abcde -t 900 -r "d3# restart"</font></pre></ul>
83<p style="line-height: 100%;">The specification of the
84environment
85variable <tt><tt><font style="font-size: 10pt;" size="2">writ</font></tt></tt><tt><tt><font style="font-size: 10pt;" size="2">e_binary</font><font style="font-size: 11pt;" size="2">, </font></tt></tt>which
86must be
[5]87assigned the value <tt><tt><font style="font-size: 10pt;" size="2">true</font></tt></tt>,
88is essential. Only in this case the model writes
[62]89binary-coded data for a possible restart run to the local file <tt><tt><a href="chapter_3.4.html#BINOUT">BINOUT</a></tt></tt>
[5]90at the end of the run. Then of course this output file must be stored
91on a permanent file with an appropriate file connection statement
92(last line of the example above). As you can see, both instructions
93(variable declaration and connection statements) are only carried out
[62]94by <b>mrun</b>, if the character string <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt>
[5]95is given for the option <tt><font style="font-size: 10pt;" size="2">-r</font>
[62]96</tt>in the <span style="font-weight: bold;">mrun</span>
97call. Thus
[5]98the example above can also be used
99if no restart runs are intended. In such cases the character string
100<tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt>
101with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt>
102can simply be omitted. </p>
103<p style="line-height: 100%;">Only by the specification of
[62]104<tt><font style="font-size: 10pt;" size="2">write_binary=true</font><font style="font-size: 11pt;" size="2">
105</font><font face="Thorndale, serif">the</font></tt>
106model is
[5]107instructed to compute the remaining CPU time after each time step and
108stop, if the run is not going to be completed and finished briefly
109before expiration of
110this time. Actually the stop takes place when the
111difference from the available job time (determined by the <b>mrun</b>
[62]112option <tt><font style="font-size: 10pt;" size="2">-t</font></tt>)
113and
[5]114the time used so far by the job becomes smaller than the time given
[62]115by the model variable <a href="chapter_4.2.html#termination_time_needed">termination_time_needed</a>.
116With the variable <b>termination_time_needed </b>the user
117determines,
[5]118how much time is needed for binary copying of the data for restart
119runs, as
120well as for the following data archiving and transfer of result data
121etc. (as long as this is part of the job). Thus, as soon as the
[62]122remaining job time is less than <b>termination_time_needed</b>,
123the
[5]124model stops
125the time step procedure and copies the data for a restart run to the
126local binary file BINOUT. The so-called initialization parameters are
127also written to this file (see <a href="chapter_4.0.html">chapter
1284.0</a>). In a last step the model produces another file with the
129local name CONTINUE_RUN. The presence of this file signals <b>mrun</b>
130the fact that a restart run must be started and leads to the
131start of an appropriate job. </p>
[62]132<p style="line-height: 100%;"><font color="#000000">During
133the initial
134phase of a restart ru</font>n different actions than during the
135initial
[5]136phase of an initial run of the model are neccessary. In this
137case the model must read in the binary data written by the preceding
138run at the beginning of the run. Beyond that it also reads the
139initialization parameters from this file. Therefore these do not need
[62]140to be indicated in the parameter file (local name <a href="chapter_3.4.html#PARIN">PARIN</a>).
[5]141If they are indicated nevertheless and if their value deviates from
142their value of the initial run, then this is ignored. There is
143exactly one exception to this rule: with the help of the
[62]144initialization parameter <a href="chapter_4.1.html#initializing_actions">initializing_actions</a>
[5]145it is determined whether the job is a restart run or an
146initial run. If <b>initializing_actions</b> =
[62]147&ldquo;<i>read_restart_data&rdquo;</i>, then it is
148a restart
[5]149run, otherwise an initial run. The previous remarks make it
150clear that the model obviously needs two different parameter files
151(local name PARIN) for the case of job chains. One is needed for the
152initial run and contains all initialization parameters set by
153the user and the other one is needed for restart runs. The
154last one only contains the initialization parameter
155<b>initializing_actions</b> (also, initialization
156parameters with values different from the initial run may appear in
157this file, but they will be ignored), which
[62]158must have the value &ldquo;<i>read_restart_data&rdquo;</i>.
[5]159Therefore the user must produce two different parameter files if he
160wants to operate job chains. Since the model always expects the
[62]161parameter file on the local file <tt>PARIN</tt>, two
162different file
[5]163connection statements must be given for this file in the
164configuration file. One may be active only at the initial run,
[62]165the other one only at restart runs. The <b>mrun </b>call
166for the
[5]167initial run shown above activates the first of the two
[62]168specified connection statements, because the character string <tt><font style="font-size: 10pt;" size="2">d3#</font></tt>
[5]169with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt>
170coincides with the character
171string in the third column of the connection statement. Obviously
172the next statement must be active</p>
[62]173<ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font></pre></ul>
174<p style="line-height: 100%;">with the restart runs. Given
175that t<font color="#000000">his statement only gets</font>
176active if the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> is given the value
177<tt><font style="font-size: 11pt;" size="2">d3f</font></tt>
178and that
[5]179the <b>mrun</b> call for this restart run is produced
[62]180automatically (thus not by the user), <b>mrun</b>
181obviously has to
[5]182replace <tt><font style="font-size: 10pt;" size="2">"d3#"</font></tt>
183of the initial run with <tt><tt><font style="font-size: 10pt;" size="2">"d3f"</font>
[62]184</tt></tt>within the call of this restart run. Actually,
185with restart
[5]186runs all <tt><font style="font-size: 10pt;" size="2">"#"</font></tt>
[62]187characters within the strings given for the options <tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">
188, </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-i</font></font></tt>
189and <tt><font style="font-size: 10pt;" size="2">-o</font></tt>
190are
191replaced by <tt><font style="font-size: 10pt;" size="2">&ldquo;f&rdquo;</font></tt>.
[5]192</p>
[62]193<p style="line-height: 100%;">For example, for the initial
194run
[5]195the permanent file </p>
[62]196<ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d</pre></ul>
197<p style="line-height: 100%;">and for restart runs the
198permanent file<span style="font-family: monospace;"> </span></p>
199<ul style="font-family: monospace;"> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3df</font></pre></ul>
200<p style="line-height: 100%;">is used. Only with restart
201runs the
202local file <tt>BININ</tt> is made available as input file,
203because
[5]204the appropriate file connection statement also contains the
205character string <tt><font style="font-size: 10pt;" size="2">"d3f"</font></tt>
206in the third column. This is logical and necessary since in BININ the
207binary data, produced by the model of the preceding job of the chain,
208are expected and the initial run does not need these
209data The permanent names of this input file (local name BININ) and
210the corresponding output file (local name BINOUT) are identical and
211read </p>
[62]212<ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/OUTPUT/abcde_d3d.</font></pre></ul>
213<p style="line-height: 100%;">However, after the file
214produced by the
[5]215previous job was read in by the model and after at the local file
216<tt>BINOUT </tt>was produced at the end of the job, the
[62]217restart job does not overwrite this permanent file (<tt>&hellip;/<font style="font-size: 10pt;" size="2">abcde_d3d</font></tt>)
[5]218with the new data. Instead of that, it is examined whether already
[62]219a permanent file with the name <tt><font style="font-size: 10pt;" size="2">&hellip;/abcde_d3d</font>
220<font face="Thorndale, serif">exists </font></tt>when
221copying the
[5]222output file (<tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>)
[62]223of <b>mrun</b>. If this is the case, <tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>
224is copied to the file<font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">
225</font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">&hellip;/abcde_d3d.1</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">.</font></font>
226Even if this file is already present, <tt><font style="font-size: 10pt;" size="2">&hellip;/abcde_d3d.2</font></tt>
[5]227is tried etc. For an input file the highest existing cycle
228of the respective permanent file is copied. In the example above this
229means: the initial run creates the permanent file
[62]230<tt><font style="font-size: 10pt;" size="2">&hellip;/abcde_d3d</font></tt><font style="font-size: 11pt;" size="2">,</font>
231the first restart run uses this file and creates <tt>&hellip;/<font style="font-size: 10pt;" size="2">abcde_d3d.1</font></tt>,
232the second restart run creates <tt><font style="font-size: 10pt;" size="2">&hellip;/abcde_d3d.2</font></tt><font style="font-size: 10pt;" size="2">
[5]233</font>etc. After completion of the job chain the user can still
234access all files created by the jobs. This makes it possible for the
235user for example to restart the model run of a certain job of the job
236chain again. </p>
[62]237<p style="line-height: 100%;">Therefore restart jobs can
238not only be
239started automatically through <b>mrun</b>, but also
240manually by the
[5]241user. This is necessary e.g. whenever after the end of a job chain
242it is decided that the simulation must be continued further, because
243the phenomenon which should be examined did not reach the desired
[62]244state yet. In such cases the <b>mrun</b> options
245completely
246correspond to those of the initial call; simply the <tt><font style="font-size: 10pt;" size="2">"#"</font></tt>
247characters in the
248arguments of options <tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">,
249</font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-i</font></font></tt>
250and <tt><font style="font-size: 10pt;" size="2">-o</font></tt>
251must be
[5]252replaced by <tt><font style="font-size: 10pt;" size="2">"f"</font></tt>.
253</p>
[62]254<hr><p style="line-height: 100%;"><br>
255<font color="#000080"><font color="#000080"><a href="chapter_3.2.html"><font color="#000080"><img src="left.gif" name="Grafik1" align="bottom" border="2" height="32" width="32"></font></a><a href="index.html"><font color="#000080"><img src="up.gif" name="Grafik2" align="bottom" border="2" height="32" width="32"></font></a><a href="chapter_3.4.html"><font color="#000080"><img src="right.gif" name="Grafik3" align="bottom" border="2" height="32" width="32"></font></a></font></font></p><p style="line-height: 100%;"><i>Last change:&nbsp;</i>
256$Id: chapter_3.3.html 62 2007-03-13 02:52:40Z suehring $</p>
257</body></html>
Note: See TracBrowser for help on using the repository browser.