[5] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
---|
| 2 | <html> |
---|
| 3 | <head> |
---|
| 4 | <meta http-equiv="CONTENT-TYPE" |
---|
| 5 | content="text/html; charset=windows-1252"> |
---|
| 6 | <title>PALM chapter 3.3</title> |
---|
| 7 | <meta name="GENERATOR" content="StarOffice 7 (Win32)"> |
---|
| 8 | <meta name="AUTHOR" content="Marcus Oliver Letzel"> |
---|
| 9 | <meta name="CREATED" content="20040728;14053490"> |
---|
| 10 | <meta name="CHANGED" content="20041112;14150257"> |
---|
| 11 | <meta name="KEYWORDS" content="parallel LES model"> |
---|
| 12 | <style> |
---|
| 13 | <!-- |
---|
| 14 | @page { size: 21cm 29.7cm } |
---|
| 15 | --> |
---|
| 16 | </style> |
---|
| 17 | </head> |
---|
| 18 | <body dir="ltr" lang="en-US"> |
---|
| 19 | <h3 style="line-height: 100%;">3.3 Initialization and restart runs</h3> |
---|
| 20 | <p style="line-height: 100%;">A job started by <b>mrun</b> will |
---|
| 21 | - according to its requested computing time, its memory size |
---|
| 22 | requirement and |
---|
| 23 | the number of necessary processing elements (on parallel computers) - |
---|
| 24 | be queued by the queuing-system of the remote computer into a suitable |
---|
| 25 | job |
---|
| 26 | class which fulfills these requirements. Each job class permits only |
---|
| 27 | jobs with certain maximum requirements (e.g. |
---|
| 28 | the job class <tt><font style="font-size: 11pt;" size="2">cdev</font></tt> |
---|
| 29 | on the IBM Regatta "hanni" of the HLRN permits only jobs with no more |
---|
| 30 | than 7200 seconds required computing time and with using no more than |
---|
| 31 | 32 |
---|
| 32 | processing elements). The job classes are important for the scheduling |
---|
| 33 | process of the computer. Jobs with small requirements usually |
---|
| 34 | come to execution |
---|
| 35 | very fast, jobs with higher requirements must wait longer (sometimes |
---|
| 36 | several days). </p> |
---|
| 37 | <p style="line-height: 100%;">Before the start of a model run the user |
---|
| 38 | must estimate how much CPU time the model will need for the simulation. |
---|
| 39 | The necessary time in seconds has to be indicated with the mrun |
---|
| 40 | <b>option</b> <tt><a |
---|
| 41 | href="http://www.muk.uni-hannover.de/institut/software/mrun_beschreibung.html#Opt-t">-t</a></tt> |
---|
| 42 | and has an influence on the job class into which the job is queued. Due |
---|
| 43 | to the fact that the model usually uses a variable |
---|
| 44 | time step and thus the number of time steps to be executed and |
---|
| 45 | consequently the time needed by the model is not |
---|
| 46 | known at the beginning, this can be measured only very roughly in |
---|
| 47 | many cases. So it may happen that the model needs more time than |
---|
| 48 | indicated for the option <tt><u><font style="font-size: 10pt;" size="2">-t</font></u>,</tt> |
---|
| 49 | which normally leads to an abort of the job as soon as the available |
---|
| 50 | CPU time is consumed. In principle one could solve this problem by |
---|
| 51 | setting a very generously estimated value for <u><font |
---|
| 52 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-t</font></font></u>, |
---|
| 53 | but this will possibly lead to the disadvantage that the queued job has |
---|
| 54 | to wait longer for execution.<br> |
---|
| 55 | </p> |
---|
| 56 | <p style="line-height: 100%;">To avoid this problem <b>mrun </b>offers |
---|
| 57 | the possibility of so-called <b>restart runs</b>. During the model |
---|
| 58 | run PALM continuously examines how much time is left for the |
---|
| 59 | execution of the job. If the run is not completed and finished shortly |
---|
| 60 | before |
---|
| 61 | expiration of this time, the model stops and writes down the values |
---|
| 62 | of (nearly) all model variables in binary form to a file (local name |
---|
| 63 | <a href="chapter_3.4.html#BINOUT">BINOUT</a>). |
---|
| 64 | After copying the output files required by the user, <b>mrun</b> |
---|
| 65 | automatically starts a restart run. For this purpose a new <b>mrun</b> |
---|
| 66 | call is set off automatically on the local computer of the user; <b>mrun</b> |
---|
| 67 | thus calls itself. The options with this call correspond to a large |
---|
| 68 | extent to those which the user had selected with his initial call of <b>mrun</b>. |
---|
| 69 | The model restarts and this time at the beginning it reads in the |
---|
| 70 | binary data written before and continues the run with them. If in |
---|
| 71 | this job the CPU time is not sufficient either, in order to terminate |
---|
| 72 | the run, at the end of the job another restart run is started, etc., |
---|
| 73 | until the time which shall be simulated by the model, is reached. |
---|
| 74 | Thus a set of restart runs can develop - a so-called job chain. The |
---|
| 75 | first run of this chain (model start at t=0) is called |
---|
| 76 | <b>initial run</b>. </p> |
---|
| 77 | <p style="line-height: 100%;">Working with restart runs and their |
---|
| 78 | generation through <b>mrun</b> requires certain entries in the |
---|
| 79 | mrun-configuration file and in the parameter file, which are |
---|
| 80 | described and explained in the following. The configuration file must |
---|
| 81 | contain the following entries (example for the IBM Regatta of the |
---|
| 82 | HLRN): </p> |
---|
| 83 | <ul> |
---|
| 84 | <pre style="line-height: 100%;"><font style="font-size: 10pt;" |
---|
| 85 | size="2">%write_binary true restart</font><br><font |
---|
| 86 | style="font-size: 10pt;" size="2">#</font><br><a |
---|
| 87 | href="chapter_3.4.html#PARIN"><font style="font-size: 10pt;" size="2">PARIN</font></a><font |
---|
| 88 | style="font-size: 10pt;" size="2"> in:job:npe d3# ~/palm/current_version/JOBS/$fname/INPUT _p3d</font><br><font |
---|
| 89 | style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font><br><a |
---|
| 90 | href="chapter_3.4.html#BININ"><font style="font-size: 10pt;" size="2">BININ</font></a><font |
---|
| 91 | style="font-size: 10pt;" size="2"> in:loc d3f ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font><br><font |
---|
| 92 | style="font-size: 10pt;" size="2">#</font><br><a |
---|
| 93 | href="chapter_3.4.html#BINOUT"><font style="font-size: 10pt;" size="2">BINOUT</font></a><font |
---|
| 94 | style="font-size: 10pt;" size="2"> out:loc restart ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font></pre> |
---|
| 95 | </ul> |
---|
| 96 | <p style="line-height: 100%;">The <b>mrun</b> call for the |
---|
| 97 | initialization run of the job chain must look as follows: </p> |
---|
| 98 | <ul> |
---|
| 99 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
| 100 | style="font-size: 10pt;" size="2">mrun -h ibmh -d abcde -t 900 -r "d3# restart"</font></pre> |
---|
| 101 | </ul> |
---|
| 102 | <p style="line-height: 100%;">The specification of the environment |
---|
| 103 | variable <tt><tt><font style="font-size: 10pt;" size="2">writ</font></tt></tt><tt><tt><font |
---|
| 104 | style="font-size: 10pt;" size="2">e_binary</font><font |
---|
| 105 | style="font-size: 11pt;" size="2">, </font></tt></tt>which must be |
---|
| 106 | assigned the value <tt><tt><font style="font-size: 10pt;" size="2">true</font></tt></tt>, |
---|
| 107 | is essential. Only in this case the model writes |
---|
| 108 | binary-coded data for a possible restart run to the local file <tt><tt><a |
---|
| 109 | href="chapter_3.4.html#BINOUT">BINOUT</a></tt></tt> |
---|
| 110 | at the end of the run. Then of course this output file must be stored |
---|
| 111 | on a permanent file with an appropriate file connection statement |
---|
| 112 | (last line of the example above). As you can see, both instructions |
---|
| 113 | (variable declaration and connection statements) are only carried out |
---|
| 114 | by <b>mrun</b>, if the character string <tt><tt><font |
---|
| 115 | style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
| 116 | is given for the option <tt><font style="font-size: 10pt;" size="2">-r</font> |
---|
| 117 | </tt>in the <span style="font-weight: bold;">mrun</span> call. Thus |
---|
| 118 | the example above can also be used |
---|
| 119 | if no restart runs are intended. In such cases the character string |
---|
| 120 | <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
| 121 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
| 122 | can simply be omitted. </p> |
---|
| 123 | <p style="line-height: 100%;">Only by the specification of |
---|
| 124 | <tt><font style="font-size: 10pt;" size="2">write_binary=true</font><font |
---|
| 125 | style="font-size: 11pt;" size="2"> |
---|
| 126 | </font><font face="Thorndale, serif">the</font></tt> model is |
---|
| 127 | instructed to compute the remaining CPU time after each time step and |
---|
| 128 | stop, if the run is not going to be completed and finished briefly |
---|
| 129 | before expiration of |
---|
| 130 | this time. Actually the stop takes place when the |
---|
| 131 | difference from the available job time (determined by the <b>mrun</b> |
---|
| 132 | option <tt><font style="font-size: 10pt;" size="2">-t</font></tt>) and |
---|
| 133 | the time used so far by the job becomes smaller than the time given |
---|
| 134 | by the model variable <a |
---|
| 135 | href="chapter_4.2.html#termination_time_needed">termination_time_needed</a>. |
---|
| 136 | With the variable <b>termination_time_needed </b>the user determines, |
---|
| 137 | how much time is needed for binary copying of the data for restart |
---|
| 138 | runs, as |
---|
| 139 | well as for the following data archiving and transfer of result data |
---|
| 140 | etc. (as long as this is part of the job). Thus, as soon as the |
---|
| 141 | remaining job time is less than <b>termination_time_needed</b>, the |
---|
| 142 | model stops |
---|
| 143 | the time step procedure and copies the data for a restart run to the |
---|
| 144 | local binary file BINOUT. The so-called initialization parameters are |
---|
| 145 | also written to this file (see <a href="chapter_4.0.html">chapter |
---|
| 146 | 4.0</a>). In a last step the model produces another file with the |
---|
| 147 | local name CONTINUE_RUN. The presence of this file signals <b>mrun</b> |
---|
| 148 | the fact that a restart run must be started and leads to the |
---|
| 149 | start of an appropriate job. </p> |
---|
| 150 | <p style="line-height: 100%;"><font color="#000000">During the initial |
---|
| 151 | phase of a restart ru</font>n different actions than during the initial |
---|
| 152 | phase of an initial run of the model are neccessary. In this |
---|
| 153 | case the model must read in the binary data written by the preceding |
---|
| 154 | run at the beginning of the run. Beyond that it also reads the |
---|
| 155 | initialization parameters from this file. Therefore these do not need |
---|
| 156 | to be indicated in the parameter file (local name <a |
---|
| 157 | href="chapter_3.4.html#PARIN">PARIN</a>). |
---|
| 158 | If they are indicated nevertheless and if their value deviates from |
---|
| 159 | their value of the initial run, then this is ignored. There is |
---|
| 160 | exactly one exception to this rule: with the help of the |
---|
| 161 | initialization parameter <a |
---|
| 162 | href="chapter_4.1.html#initializing_actions">initializing_actions</a> |
---|
| 163 | it is determined whether the job is a restart run or an |
---|
| 164 | initial run. If <b>initializing_actions</b> = |
---|
| 165 | <i>read_restart_data</i>, then it is a restart |
---|
| 166 | run, otherwise an initial run. The previous remarks make it |
---|
| 167 | clear that the model obviously needs two different parameter files |
---|
| 168 | (local name PARIN) for the case of job chains. One is needed for the |
---|
| 169 | initial run and contains all initialization parameters set by |
---|
| 170 | the user and the other one is needed for restart runs. The |
---|
| 171 | last one only contains the initialization parameter |
---|
| 172 | <b>initializing_actions</b> (also, initialization |
---|
| 173 | parameters with values different from the initial run may appear in |
---|
| 174 | this file, but they will be ignored), which |
---|
| 175 | must have the value <i>read_restart_data</i>. |
---|
| 176 | Therefore the user must produce two different parameter files if he |
---|
| 177 | wants to operate job chains. Since the model always expects the |
---|
| 178 | parameter file on the local file <tt>PARIN</tt>, two different file |
---|
| 179 | connection statements must be given for this file in the |
---|
| 180 | configuration file. One may be active only at the initial run, |
---|
| 181 | the other one only at restart runs. The <b>mrun </b>call for the |
---|
| 182 | initial run shown above activates the first of the two |
---|
| 183 | specified connection statements, because the character string <tt><font |
---|
| 184 | style="font-size: 10pt;" size="2">d3#</font></tt> |
---|
| 185 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
| 186 | coincides with the character |
---|
| 187 | string in the third column of the connection statement. Obviously |
---|
| 188 | the next statement must be active</p> |
---|
| 189 | <ul> |
---|
| 190 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
| 191 | style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font></pre> |
---|
| 192 | </ul> |
---|
| 193 | <p style="line-height: 100%;">with the restart runs. Given that t<font |
---|
| 194 | color="#000000">his statement only gets</font> active if the option <tt><font |
---|
| 195 | style="font-size: 10pt;" size="2">-r</font></tt> is given the value |
---|
| 196 | <tt><font style="font-size: 11pt;" size="2">d3f</font></tt> and that |
---|
| 197 | the <b>mrun</b> call for this restart run is produced |
---|
| 198 | automatically (thus not by the user), <b>mrun</b> obviously has to |
---|
| 199 | replace <tt><font style="font-size: 10pt;" size="2">"d3#"</font></tt> |
---|
| 200 | of the initial run with <tt><tt><font style="font-size: 10pt;" size="2">"d3f"</font> |
---|
| 201 | </tt></tt>within the call of this restart run. Actually, with restart |
---|
| 202 | runs all <tt><font style="font-size: 10pt;" size="2">"#"</font></tt> |
---|
| 203 | characters within the strings given for the options <tt><font |
---|
| 204 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font |
---|
| 205 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace"> |
---|
| 206 | , </font></font><tt><font style="font-size: 10pt;" size="2"><font |
---|
| 207 | face="Cumberland, monospace">-i</font></font></tt> |
---|
| 208 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> are |
---|
| 209 | replaced by <tt><font style="font-size: 10pt;" size="2">f</font></tt>. |
---|
| 210 | </p> |
---|
| 211 | <p style="line-height: 100%;">For example, for the initial run |
---|
| 212 | the permanent file </p> |
---|
| 213 | <ul> |
---|
| 214 | <pre style="margin-bottom: 0.5cm; line-height: 100%;">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d</pre> |
---|
| 215 | </ul> |
---|
| 216 | <p style="line-height: 100%;">and for restart runs the permanent file<span |
---|
| 217 | style="font-family: monospace;"> </span></p> |
---|
| 218 | <ul style="font-family: monospace;"> |
---|
| 219 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
| 220 | style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3df</font></pre> |
---|
| 221 | </ul> |
---|
| 222 | <p style="line-height: 100%;">is used. Only with restart runs the |
---|
| 223 | local file <tt>BININ</tt> is made available as input file, because |
---|
| 224 | the appropriate file connection statement also contains the |
---|
| 225 | character string <tt><font style="font-size: 10pt;" size="2">"d3f"</font></tt> |
---|
| 226 | in the third column. This is logical and necessary since in BININ the |
---|
| 227 | binary data, produced by the model of the preceding job of the chain, |
---|
| 228 | are expected and the initial run does not need these |
---|
| 229 | data The permanent names of this input file (local name BININ) and |
---|
| 230 | the corresponding output file (local name BINOUT) are identical and |
---|
| 231 | read </p> |
---|
| 232 | <ul> |
---|
| 233 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
| 234 | style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/OUTPUT/abcde_d3d.</font></pre> |
---|
| 235 | </ul> |
---|
| 236 | <p style="line-height: 100%;">However, after the file produced by the |
---|
| 237 | previous job was read in by the model and after at the local file |
---|
| 238 | <tt>BINOUT </tt>was produced at the end of the job, the |
---|
| 239 | restart job does not overwrite this permanent file (<tt>
/<font |
---|
| 240 | style="font-size: 10pt;" size="2">abcde_d3d</font></tt>) |
---|
| 241 | with the new data. Instead of that, it is examined whether already |
---|
| 242 | a permanent file with the name <tt><font style="font-size: 10pt;" |
---|
| 243 | size="2">
/abcde_d3d</font> |
---|
| 244 | <font face="Thorndale, serif">exists </font></tt>when copying the |
---|
| 245 | output file (<tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>) |
---|
| 246 | of <b>mrun</b>. If this is the case, <tt><font |
---|
| 247 | style="font-size: 10pt;" size="2">BINOUT</font></tt> |
---|
| 248 | is copied to the file<font style="font-size: 10pt;" size="2"><font |
---|
| 249 | face="Cumberland, monospace"> |
---|
| 250 | </font></font><tt><font style="font-size: 10pt;" size="2"><font |
---|
| 251 | face="Cumberland, monospace">
/abcde_d3d.1</font></font></tt><font |
---|
| 252 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">.</font></font> |
---|
| 253 | Even if this file is already present, <tt><font |
---|
| 254 | style="font-size: 10pt;" size="2">
/abcde_d3d.2</font></tt> |
---|
| 255 | is tried etc. For an input file the highest existing cycle |
---|
| 256 | of the respective permanent file is copied. In the example above this |
---|
| 257 | means: the initial run creates the permanent file |
---|
| 258 | <tt><font style="font-size: 10pt;" size="2">
/abcde_d3d</font></tt><font |
---|
| 259 | style="font-size: 11pt;" size="2">,</font> |
---|
| 260 | the first restart run uses this file and creates <tt>
/<font |
---|
| 261 | style="font-size: 10pt;" size="2">abcde_d3d.1</font></tt>, |
---|
| 262 | the second restart run creates <tt><font style="font-size: 10pt;" |
---|
| 263 | size="2">
/abcde_d3d.2</font></tt><font style="font-size: 10pt;" |
---|
| 264 | size="2"> |
---|
| 265 | </font>etc. After completion of the job chain the user can still |
---|
| 266 | access all files created by the jobs. This makes it possible for the |
---|
| 267 | user for example to restart the model run of a certain job of the job |
---|
| 268 | chain again. </p> |
---|
| 269 | <p style="line-height: 100%;">Therefore restart jobs can not only be |
---|
| 270 | started automatically through <b>mrun</b>, but also manually by the |
---|
| 271 | user. This is necessary e.g. whenever after the end of a job chain |
---|
| 272 | it is decided that the simulation must be continued further, because |
---|
| 273 | the phenomenon which should be examined did not reach the desired |
---|
| 274 | state yet. In such cases the <b>mrun</b> options completely |
---|
| 275 | correspond to those of the initial call; simply the <tt><font |
---|
| 276 | style="font-size: 10pt;" size="2">"#"</font></tt> characters in the |
---|
| 277 | arguments of options <tt><font style="font-size: 10pt;" size="2"><font |
---|
| 278 | face="Cumberland, monospace">-r</font></font></tt><font |
---|
| 279 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">, |
---|
| 280 | </font></font><tt><font style="font-size: 10pt;" size="2"><font |
---|
| 281 | face="Cumberland, monospace">-i</font></font></tt> |
---|
| 282 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> must be |
---|
| 283 | replaced by <tt><font style="font-size: 10pt;" size="2">"f"</font></tt>. |
---|
| 284 | </p> |
---|
| 285 | <hr> |
---|
| 286 | <p style="line-height: 100%;"><br> |
---|
| 287 | <font color="#000080"><font color="#000080"><a href="chapter_3.2.html"><font |
---|
| 288 | color="#000080"><img src="left.gif" name="Grafik1" align="bottom" |
---|
| 289 | border="2" height="32" width="32"></font></a><a href="index.html"><font |
---|
| 290 | color="#000080"><img src="up.gif" name="Grafik2" align="bottom" |
---|
| 291 | border="2" height="32" width="32"></font></a><a href="chapter_3.4.html"><font |
---|
| 292 | color="#000080"><img src="right.gif" name="Grafik3" align="bottom" |
---|
| 293 | border="2" height="32" width="32"></font></a></font></font></p> |
---|
| 294 | <p style="line-height: 100%;"><i>Last change: </i> 14/04/05 (SR)</p> |
---|
| 295 | </body> |
---|
| 296 | </html> |
---|