[5] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
---|
[62] | 2 | <html><head> |
---|
| 3 | <meta http-equiv="CONTENT-TYPE" content="text/html; charset=windows-1252"><title>PALM |
---|
| 4 | chapter 3.3</title> <meta name="GENERATOR" content="StarOffice 7 (Win32)"> <meta name="AUTHOR" content="Marcus Oliver Letzel"> <meta name="CREATED" content="20040728;14053490"> <meta name="CHANGED" content="20041112;14150257"> <meta name="KEYWORDS" content="parallel LES model"> <style> |
---|
| 5 | <!-- |
---|
| 6 | @page { size: 21cm 29.7cm } |
---|
| 7 | --> |
---|
| 8 | </style></head> |
---|
| 9 | |
---|
| 10 | <body style="direction: ltr;" lang="en-US"><h3 style="line-height: 100%;">3.3 Initialization and restart |
---|
| 11 | runs</h3> |
---|
| 12 | <p style="line-height: 100%;">A job started by <b>mrun</b> |
---|
| 13 | will |
---|
[5] | 14 | - according to its requested computing time, its memory size |
---|
| 15 | requirement and |
---|
| 16 | the number of necessary processing elements (on parallel computers) - |
---|
| 17 | be queued by the queuing-system of the remote computer into a suitable |
---|
| 18 | job |
---|
| 19 | class which fulfills these requirements. Each job class permits only |
---|
| 20 | jobs with certain maximum requirements (e.g. |
---|
| 21 | the job class <tt><font style="font-size: 11pt;" size="2">cdev</font></tt> |
---|
| 22 | on the IBM Regatta "hanni" of the HLRN permits only jobs with no more |
---|
| 23 | than 7200 seconds required computing time and with using no more than |
---|
| 24 | 32 |
---|
| 25 | processing elements). The job classes are important for the scheduling |
---|
| 26 | process of the computer. Jobs with small requirements usually |
---|
| 27 | come to execution |
---|
| 28 | very fast, jobs with higher requirements must wait longer (sometimes |
---|
| 29 | several days). </p> |
---|
[62] | 30 | <p style="line-height: 100%;">Before the start of a model |
---|
| 31 | run the user |
---|
[5] | 32 | must estimate how much CPU time the model will need for the simulation. |
---|
| 33 | The necessary time in seconds has to be indicated with the mrun |
---|
[62] | 34 | <b>option</b> <tt><a href="http://www.muk.uni-hannover.de/institut/software/mrun_beschreibung.html#Opt-t">-t</a></tt> |
---|
[5] | 35 | and has an influence on the job class into which the job is queued. Due |
---|
| 36 | to the fact that the model usually uses a variable |
---|
| 37 | time step and thus the number of time steps to be executed and |
---|
| 38 | consequently the time needed by the model is not |
---|
| 39 | known at the beginning, this can be measured only very roughly in |
---|
| 40 | many cases. So it may happen that the model needs more time than |
---|
| 41 | indicated for the option <tt><u><font style="font-size: 10pt;" size="2">-t</font></u>,</tt> |
---|
| 42 | which normally leads to an abort of the job as soon as the available |
---|
| 43 | CPU time is consumed. In principle one could solve this problem by |
---|
[62] | 44 | setting a very generously estimated value for <u><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-t</font></font></u>, |
---|
[5] | 45 | but this will possibly lead to the disadvantage that the queued job has |
---|
| 46 | to wait longer for execution.<br> |
---|
[62] | 47 | </p><p style="line-height: 100%;">To avoid this |
---|
| 48 | problem <b>mrun </b>offers |
---|
| 49 | the possibility of so-called <b>restart runs</b>. During |
---|
| 50 | the model |
---|
[5] | 51 | run PALM continuously examines how much time is left for the |
---|
| 52 | execution of the job. If the run is not completed and finished shortly |
---|
| 53 | before |
---|
| 54 | expiration of this time, the model stops and writes down the values |
---|
| 55 | of (nearly) all model variables in binary form to a file (local name |
---|
| 56 | <a href="chapter_3.4.html#BINOUT">BINOUT</a>). |
---|
| 57 | After copying the output files required by the user, <b>mrun</b> |
---|
| 58 | automatically starts a restart run. For this purpose a new <b>mrun</b> |
---|
| 59 | call is set off automatically on the local computer of the user; <b>mrun</b> |
---|
| 60 | thus calls itself. The options with this call correspond to a large |
---|
| 61 | extent to those which the user had selected with his initial call of <b>mrun</b>. |
---|
| 62 | The model restarts and this time at the beginning it reads in the |
---|
| 63 | binary data written before and continues the run with them. If in |
---|
| 64 | this job the CPU time is not sufficient either, in order to terminate |
---|
| 65 | the run, at the end of the job another restart run is started, etc., |
---|
| 66 | until the time which shall be simulated by the model, is reached. |
---|
| 67 | Thus a set of restart runs can develop - a so-called job chain. The |
---|
| 68 | first run of this chain (model start at t=0) is called |
---|
| 69 | <b>initial run</b>. </p> |
---|
[62] | 70 | <p style="line-height: 100%;">Working with restart runs |
---|
| 71 | and their |
---|
| 72 | generation through <b>mrun</b> requires certain entries in |
---|
| 73 | the |
---|
[5] | 74 | mrun-configuration file and in the parameter file, which are |
---|
| 75 | described and explained in the following. The configuration file must |
---|
| 76 | contain the following entries (example for the IBM Regatta of the |
---|
| 77 | HLRN): </p> |
---|
[62] | 78 | <ul> <pre style="line-height: 100%;"><font style="font-size: 10pt;" size="2">%write_binary true restart</font><br><font style="font-size: 10pt;" size="2">#</font><br><a href="chapter_3.4.html#PARIN"><font style="font-size: 10pt;" size="2">PARIN</font></a><font style="font-size: 10pt;" size="2"> in:job:npe d3# ~/palm/current_version/JOBS/$fname/INPUT _p3d</font><br><font style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font><br><a href="chapter_3.4.html#BININ"><font style="font-size: 10pt;" size="2">BININ</font></a><font style="font-size: 10pt;" size="2"> in:loc d3f ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font><br><font style="font-size: 10pt;" size="2">#</font><br><a href="chapter_3.4.html#BINOUT"><font style="font-size: 10pt;" size="2">BINOUT</font></a><font style="font-size: 10pt;" size="2"> out:loc restart ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font></pre></ul> |
---|
| 79 | <p style="line-height: 100%;">The <b>mrun</b> |
---|
| 80 | call for the |
---|
[5] | 81 | initialization run of the job chain must look as follows: </p> |
---|
[62] | 82 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">mrun -h ibmh -d abcde -t 900 -r "d3# restart"</font></pre></ul> |
---|
| 83 | <p style="line-height: 100%;">The specification of the |
---|
| 84 | environment |
---|
| 85 | variable <tt><tt><font style="font-size: 10pt;" size="2">writ</font></tt></tt><tt><tt><font style="font-size: 10pt;" size="2">e_binary</font><font style="font-size: 11pt;" size="2">, </font></tt></tt>which |
---|
| 86 | must be |
---|
[5] | 87 | assigned the value <tt><tt><font style="font-size: 10pt;" size="2">true</font></tt></tt>, |
---|
| 88 | is essential. Only in this case the model writes |
---|
[62] | 89 | binary-coded data for a possible restart run to the local file <tt><tt><a href="chapter_3.4.html#BINOUT">BINOUT</a></tt></tt> |
---|
[5] | 90 | at the end of the run. Then of course this output file must be stored |
---|
| 91 | on a permanent file with an appropriate file connection statement |
---|
| 92 | (last line of the example above). As you can see, both instructions |
---|
| 93 | (variable declaration and connection statements) are only carried out |
---|
[62] | 94 | by <b>mrun</b>, if the character string <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
[5] | 95 | is given for the option <tt><font style="font-size: 10pt;" size="2">-r</font> |
---|
[62] | 96 | </tt>in the <span style="font-weight: bold;">mrun</span> |
---|
| 97 | call. Thus |
---|
[5] | 98 | the example above can also be used |
---|
| 99 | if no restart runs are intended. In such cases the character string |
---|
| 100 | <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
| 101 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
| 102 | can simply be omitted. </p> |
---|
| 103 | <p style="line-height: 100%;">Only by the specification of |
---|
[62] | 104 | <tt><font style="font-size: 10pt;" size="2">write_binary=true</font><font style="font-size: 11pt;" size="2"> |
---|
| 105 | </font><font face="Thorndale, serif">the</font></tt> |
---|
| 106 | model is |
---|
[5] | 107 | instructed to compute the remaining CPU time after each time step and |
---|
| 108 | stop, if the run is not going to be completed and finished briefly |
---|
| 109 | before expiration of |
---|
| 110 | this time. Actually the stop takes place when the |
---|
| 111 | difference from the available job time (determined by the <b>mrun</b> |
---|
[62] | 112 | option <tt><font style="font-size: 10pt;" size="2">-t</font></tt>) |
---|
| 113 | and |
---|
[5] | 114 | the time used so far by the job becomes smaller than the time given |
---|
[62] | 115 | by the model variable <a href="chapter_4.2.html#termination_time_needed">termination_time_needed</a>. |
---|
| 116 | With the variable <b>termination_time_needed </b>the user |
---|
| 117 | determines, |
---|
[5] | 118 | how much time is needed for binary copying of the data for restart |
---|
| 119 | runs, as |
---|
| 120 | well as for the following data archiving and transfer of result data |
---|
| 121 | etc. (as long as this is part of the job). Thus, as soon as the |
---|
[62] | 122 | remaining job time is less than <b>termination_time_needed</b>, |
---|
| 123 | the |
---|
[5] | 124 | model stops |
---|
| 125 | the time step procedure and copies the data for a restart run to the |
---|
| 126 | local binary file BINOUT. The so-called initialization parameters are |
---|
| 127 | also written to this file (see <a href="chapter_4.0.html">chapter |
---|
| 128 | 4.0</a>). In a last step the model produces another file with the |
---|
| 129 | local name CONTINUE_RUN. The presence of this file signals <b>mrun</b> |
---|
| 130 | the fact that a restart run must be started and leads to the |
---|
| 131 | start of an appropriate job. </p> |
---|
[62] | 132 | <p style="line-height: 100%;"><font color="#000000">During |
---|
| 133 | the initial |
---|
| 134 | phase of a restart ru</font>n different actions than during the |
---|
| 135 | initial |
---|
[5] | 136 | phase of an initial run of the model are neccessary. In this |
---|
| 137 | case the model must read in the binary data written by the preceding |
---|
| 138 | run at the beginning of the run. Beyond that it also reads the |
---|
| 139 | initialization parameters from this file. Therefore these do not need |
---|
[62] | 140 | to be indicated in the parameter file (local name <a href="chapter_3.4.html#PARIN">PARIN</a>). |
---|
[5] | 141 | If they are indicated nevertheless and if their value deviates from |
---|
| 142 | their value of the initial run, then this is ignored. There is |
---|
| 143 | exactly one exception to this rule: with the help of the |
---|
[62] | 144 | initialization parameter <a href="chapter_4.1.html#initializing_actions">initializing_actions</a> |
---|
[5] | 145 | it is determined whether the job is a restart run or an |
---|
| 146 | initial run. If <b>initializing_actions</b> = |
---|
[62] | 147 | “<i>read_restart_data”</i>, then it is |
---|
| 148 | a restart |
---|
[5] | 149 | run, otherwise an initial run. The previous remarks make it |
---|
| 150 | clear that the model obviously needs two different parameter files |
---|
| 151 | (local name PARIN) for the case of job chains. One is needed for the |
---|
| 152 | initial run and contains all initialization parameters set by |
---|
| 153 | the user and the other one is needed for restart runs. The |
---|
| 154 | last one only contains the initialization parameter |
---|
| 155 | <b>initializing_actions</b> (also, initialization |
---|
| 156 | parameters with values different from the initial run may appear in |
---|
| 157 | this file, but they will be ignored), which |
---|
[62] | 158 | must have the value “<i>read_restart_data”</i>. |
---|
[5] | 159 | Therefore the user must produce two different parameter files if he |
---|
| 160 | wants to operate job chains. Since the model always expects the |
---|
[62] | 161 | parameter file on the local file <tt>PARIN</tt>, two |
---|
| 162 | different file |
---|
[5] | 163 | connection statements must be given for this file in the |
---|
| 164 | configuration file. One may be active only at the initial run, |
---|
[62] | 165 | the other one only at restart runs. The <b>mrun </b>call |
---|
| 166 | for the |
---|
[5] | 167 | initial run shown above activates the first of the two |
---|
[62] | 168 | specified connection statements, because the character string <tt><font style="font-size: 10pt;" size="2">d3#</font></tt> |
---|
[5] | 169 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
| 170 | coincides with the character |
---|
| 171 | string in the third column of the connection statement. Obviously |
---|
| 172 | the next statement must be active</p> |
---|
[62] | 173 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font></pre></ul> |
---|
| 174 | <p style="line-height: 100%;">with the restart runs. Given |
---|
| 175 | that t<font color="#000000">his statement only gets</font> |
---|
| 176 | active if the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> is given the value |
---|
| 177 | <tt><font style="font-size: 11pt;" size="2">d3f</font></tt> |
---|
| 178 | and that |
---|
[5] | 179 | the <b>mrun</b> call for this restart run is produced |
---|
[62] | 180 | automatically (thus not by the user), <b>mrun</b> |
---|
| 181 | obviously has to |
---|
[5] | 182 | replace <tt><font style="font-size: 10pt;" size="2">"d3#"</font></tt> |
---|
| 183 | of the initial run with <tt><tt><font style="font-size: 10pt;" size="2">"d3f"</font> |
---|
[62] | 184 | </tt></tt>within the call of this restart run. Actually, |
---|
| 185 | with restart |
---|
[5] | 186 | runs all <tt><font style="font-size: 10pt;" size="2">"#"</font></tt> |
---|
[62] | 187 | characters within the strings given for the options <tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace"> |
---|
| 188 | , </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-i</font></font></tt> |
---|
| 189 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> |
---|
| 190 | are |
---|
| 191 | replaced by <tt><font style="font-size: 10pt;" size="2">“f”</font></tt>. |
---|
[5] | 192 | </p> |
---|
[62] | 193 | <p style="line-height: 100%;">For example, for the initial |
---|
| 194 | run |
---|
[5] | 195 | the permanent file </p> |
---|
[62] | 196 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d</pre></ul> |
---|
| 197 | <p style="line-height: 100%;">and for restart runs the |
---|
| 198 | permanent file<span style="font-family: monospace;"> </span></p> |
---|
| 199 | <ul style="font-family: monospace;"> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3df</font></pre></ul> |
---|
| 200 | <p style="line-height: 100%;">is used. Only with restart |
---|
| 201 | runs the |
---|
| 202 | local file <tt>BININ</tt> is made available as input file, |
---|
| 203 | because |
---|
[5] | 204 | the appropriate file connection statement also contains the |
---|
| 205 | character string <tt><font style="font-size: 10pt;" size="2">"d3f"</font></tt> |
---|
| 206 | in the third column. This is logical and necessary since in BININ the |
---|
| 207 | binary data, produced by the model of the preceding job of the chain, |
---|
| 208 | are expected and the initial run does not need these |
---|
| 209 | data The permanent names of this input file (local name BININ) and |
---|
| 210 | the corresponding output file (local name BINOUT) are identical and |
---|
| 211 | read </p> |
---|
[62] | 212 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/OUTPUT/abcde_d3d.</font></pre></ul> |
---|
| 213 | <p style="line-height: 100%;">However, after the file |
---|
| 214 | produced by the |
---|
[5] | 215 | previous job was read in by the model and after at the local file |
---|
| 216 | <tt>BINOUT </tt>was produced at the end of the job, the |
---|
[62] | 217 | restart job does not overwrite this permanent file (<tt>…/<font style="font-size: 10pt;" size="2">abcde_d3d</font></tt>) |
---|
[5] | 218 | with the new data. Instead of that, it is examined whether already |
---|
[62] | 219 | a permanent file with the name <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d</font> |
---|
| 220 | <font face="Thorndale, serif">exists </font></tt>when |
---|
| 221 | copying the |
---|
[5] | 222 | output file (<tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>) |
---|
[62] | 223 | of <b>mrun</b>. If this is the case, <tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt> |
---|
| 224 | is copied to the file<font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace"> |
---|
| 225 | </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">…/abcde_d3d.1</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">.</font></font> |
---|
| 226 | Even if this file is already present, <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d.2</font></tt> |
---|
[5] | 227 | is tried etc. For an input file the highest existing cycle |
---|
| 228 | of the respective permanent file is copied. In the example above this |
---|
| 229 | means: the initial run creates the permanent file |
---|
[62] | 230 | <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d</font></tt><font style="font-size: 11pt;" size="2">,</font> |
---|
| 231 | the first restart run uses this file and creates <tt>…/<font style="font-size: 10pt;" size="2">abcde_d3d.1</font></tt>, |
---|
| 232 | the second restart run creates <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d.2</font></tt><font style="font-size: 10pt;" size="2"> |
---|
[5] | 233 | </font>etc. After completion of the job chain the user can still |
---|
| 234 | access all files created by the jobs. This makes it possible for the |
---|
| 235 | user for example to restart the model run of a certain job of the job |
---|
| 236 | chain again. </p> |
---|
[62] | 237 | <p style="line-height: 100%;">Therefore restart jobs can |
---|
| 238 | not only be |
---|
| 239 | started automatically through <b>mrun</b>, but also |
---|
| 240 | manually by the |
---|
[5] | 241 | user. This is necessary e.g. whenever after the end of a job chain |
---|
| 242 | it is decided that the simulation must be continued further, because |
---|
| 243 | the phenomenon which should be examined did not reach the desired |
---|
[62] | 244 | state yet. In such cases the <b>mrun</b> options |
---|
| 245 | completely |
---|
| 246 | correspond to those of the initial call; simply the <tt><font style="font-size: 10pt;" size="2">"#"</font></tt> |
---|
| 247 | characters in the |
---|
| 248 | arguments of options <tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">, |
---|
| 249 | </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-i</font></font></tt> |
---|
| 250 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> |
---|
| 251 | must be |
---|
[5] | 252 | replaced by <tt><font style="font-size: 10pt;" size="2">"f"</font></tt>. |
---|
| 253 | </p> |
---|
[62] | 254 | <hr><p style="line-height: 100%;"><br> |
---|
| 255 | <font color="#000080"><font color="#000080"><a href="chapter_3.2.html"><font color="#000080"><img src="left.gif" name="Grafik1" align="bottom" border="2" height="32" width="32"></font></a><a href="index.html"><font color="#000080"><img src="up.gif" name="Grafik2" align="bottom" border="2" height="32" width="32"></font></a><a href="chapter_3.4.html"><font color="#000080"><img src="right.gif" name="Grafik3" align="bottom" border="2" height="32" width="32"></font></a></font></font></p><p style="line-height: 100%;"><i>Last change: </i> |
---|
| 256 | $Id: chapter_3.3.html 62 2007-03-13 02:52:40Z fricke $</p> |
---|
| 257 | </body></html> |
---|