1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
---|
2 | <html> |
---|
3 | <head> |
---|
4 | <meta http-equiv="CONTENT-TYPE" |
---|
5 | content="text/html; charset=windows-1252"> |
---|
6 | <title>PALM chapter 3.3</title> |
---|
7 | <meta name="GENERATOR" content="StarOffice 7 (Win32)"> |
---|
8 | <meta name="AUTHOR" content="Marcus Oliver Letzel"> |
---|
9 | <meta name="CREATED" content="20040728;14053490"> |
---|
10 | <meta name="CHANGED" content="20041112;14150257"> |
---|
11 | <meta name="KEYWORDS" content="parallel LES model"> |
---|
12 | <style> |
---|
13 | <!-- |
---|
14 | @page { size: 21cm 29.7cm } |
---|
15 | --> |
---|
16 | </style> |
---|
17 | </head> |
---|
18 | <body dir="ltr" lang="en-US"> |
---|
19 | <h3 style="line-height: 100%;">3.3 Initialization and restart runs</h3> |
---|
20 | <p style="line-height: 100%;">A job started by <b>mrun</b> will |
---|
21 | - according to its requested computing time, its memory size |
---|
22 | requirement and |
---|
23 | the number of necessary processing elements (on parallel computers) - |
---|
24 | be queued by the queuing-system of the remote computer into a suitable |
---|
25 | job |
---|
26 | class which fulfills these requirements. Each job class permits only |
---|
27 | jobs with certain maximum requirements (e.g. |
---|
28 | the job class <tt><font style="font-size: 11pt;" size="2">cdev</font></tt> |
---|
29 | on the IBM Regatta "hanni" of the HLRN permits only jobs with no more |
---|
30 | than 7200 seconds required computing time and with using no more than |
---|
31 | 32 |
---|
32 | processing elements). The job classes are important for the scheduling |
---|
33 | process of the computer. Jobs with small requirements usually |
---|
34 | come to execution |
---|
35 | very fast, jobs with higher requirements must wait longer (sometimes |
---|
36 | several days). </p> |
---|
37 | <p style="line-height: 100%;">Before the start of a model run the user |
---|
38 | must estimate how much CPU time the model will need for the simulation. |
---|
39 | The necessary time in seconds has to be indicated with the mrun |
---|
40 | <b>option</b> <tt><a |
---|
41 | href="http://www.muk.uni-hannover.de/institut/software/mrun_beschreibung.html#Opt-t">-t</a></tt> |
---|
42 | and has an influence on the job class into which the job is queued. Due |
---|
43 | to the fact that the model usually uses a variable |
---|
44 | time step and thus the number of time steps to be executed and |
---|
45 | consequently the time needed by the model is not |
---|
46 | known at the beginning, this can be measured only very roughly in |
---|
47 | many cases. So it may happen that the model needs more time than |
---|
48 | indicated for the option <tt><u><font style="font-size: 10pt;" size="2">-t</font></u>,</tt> |
---|
49 | which normally leads to an abort of the job as soon as the available |
---|
50 | CPU time is consumed. In principle one could solve this problem by |
---|
51 | setting a very generously estimated value for <u><font |
---|
52 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-t</font></font></u>, |
---|
53 | but this will possibly lead to the disadvantage that the queued job has |
---|
54 | to wait longer for execution.<br> |
---|
55 | </p> |
---|
56 | <p style="line-height: 100%;">To avoid this problem <b>mrun </b>offers |
---|
57 | the possibility of so-called <b>restart runs</b>. During the model |
---|
58 | run PALM continuously examines how much time is left for the |
---|
59 | execution of the job. If the run is not completed and finished shortly |
---|
60 | before |
---|
61 | expiration of this time, the model stops and writes down the values |
---|
62 | of (nearly) all model variables in binary form to a file (local name |
---|
63 | <a href="chapter_3.4.html#BINOUT">BINOUT</a>). |
---|
64 | After copying the output files required by the user, <b>mrun</b> |
---|
65 | automatically starts a restart run. For this purpose a new <b>mrun</b> |
---|
66 | call is set off automatically on the local computer of the user; <b>mrun</b> |
---|
67 | thus calls itself. The options with this call correspond to a large |
---|
68 | extent to those which the user had selected with his initial call of <b>mrun</b>. |
---|
69 | The model restarts and this time at the beginning it reads in the |
---|
70 | binary data written before and continues the run with them. If in |
---|
71 | this job the CPU time is not sufficient either, in order to terminate |
---|
72 | the run, at the end of the job another restart run is started, etc., |
---|
73 | until the time which shall be simulated by the model, is reached. |
---|
74 | Thus a set of restart runs can develop - a so-called job chain. The |
---|
75 | first run of this chain (model start at t=0) is called |
---|
76 | <b>initial run</b>. </p> |
---|
77 | <p style="line-height: 100%;">Working with restart runs and their |
---|
78 | generation through <b>mrun</b> requires certain entries in the |
---|
79 | mrun-configuration file and in the parameter file, which are |
---|
80 | described and explained in the following. The configuration file must |
---|
81 | contain the following entries (example for the IBM Regatta of the |
---|
82 | HLRN): </p> |
---|
83 | <ul> |
---|
84 | <pre style="line-height: 100%;"><font style="font-size: 10pt;" |
---|
85 | size="2">%write_binary true restart</font><br><font |
---|
86 | style="font-size: 10pt;" size="2">#</font><br><a |
---|
87 | href="chapter_3.4.html#PARIN"><font style="font-size: 10pt;" size="2">PARIN</font></a><font |
---|
88 | style="font-size: 10pt;" size="2"> in:job:npe d3# ~/palm/current_version/JOBS/$fname/INPUT _p3d</font><br><font |
---|
89 | style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font><br><a |
---|
90 | href="chapter_3.4.html#BININ"><font style="font-size: 10pt;" size="2">BININ</font></a><font |
---|
91 | style="font-size: 10pt;" size="2"> in:loc d3f ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font><br><font |
---|
92 | style="font-size: 10pt;" size="2">#</font><br><a |
---|
93 | href="chapter_3.4.html#BINOUT"><font style="font-size: 10pt;" size="2">BINOUT</font></a><font |
---|
94 | style="font-size: 10pt;" size="2"> out:loc restart ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font></pre> |
---|
95 | </ul> |
---|
96 | <p style="line-height: 100%;">The <b>mrun</b> call for the |
---|
97 | initialization run of the job chain must look as follows: </p> |
---|
98 | <ul> |
---|
99 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
100 | style="font-size: 10pt;" size="2">mrun -h ibmh -d abcde -t 900 -r "d3# restart"</font></pre> |
---|
101 | </ul> |
---|
102 | <p style="line-height: 100%;">The specification of the environment |
---|
103 | variable <tt><tt><font style="font-size: 10pt;" size="2">writ</font></tt></tt><tt><tt><font |
---|
104 | style="font-size: 10pt;" size="2">e_binary</font><font |
---|
105 | style="font-size: 11pt;" size="2">, </font></tt></tt>which must be |
---|
106 | assigned the value <tt><tt><font style="font-size: 10pt;" size="2">true</font></tt></tt>, |
---|
107 | is essential. Only in this case the model writes |
---|
108 | binary-coded data for a possible restart run to the local file <tt><tt><a |
---|
109 | href="chapter_3.4.html#BINOUT">BINOUT</a></tt></tt> |
---|
110 | at the end of the run. Then of course this output file must be stored |
---|
111 | on a permanent file with an appropriate file connection statement |
---|
112 | (last line of the example above). As you can see, both instructions |
---|
113 | (variable declaration and connection statements) are only carried out |
---|
114 | by <b>mrun</b>, if the character string <tt><tt><font |
---|
115 | style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
116 | is given for the option <tt><font style="font-size: 10pt;" size="2">-r</font> |
---|
117 | </tt>in the <span style="font-weight: bold;">mrun</span> call. Thus |
---|
118 | the example above can also be used |
---|
119 | if no restart runs are intended. In such cases the character string |
---|
120 | <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
121 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
122 | can simply be omitted. </p> |
---|
123 | <p style="line-height: 100%;">Only by the specification of |
---|
124 | <tt><font style="font-size: 10pt;" size="2">write_binary=true</font><font |
---|
125 | style="font-size: 11pt;" size="2"> |
---|
126 | </font><font face="Thorndale, serif">the</font></tt> model is |
---|
127 | instructed to compute the remaining CPU time after each time step and |
---|
128 | stop, if the run is not going to be completed and finished briefly |
---|
129 | before expiration of |
---|
130 | this time. Actually the stop takes place when the |
---|
131 | difference from the available job time (determined by the <b>mrun</b> |
---|
132 | option <tt><font style="font-size: 10pt;" size="2">-t</font></tt>) and |
---|
133 | the time used so far by the job becomes smaller than the time given |
---|
134 | by the model variable <a |
---|
135 | href="chapter_4.2.html#termination_time_needed">termination_time_needed</a>. |
---|
136 | With the variable <b>termination_time_needed </b>the user determines, |
---|
137 | how much time is needed for binary copying of the data for restart |
---|
138 | runs, as |
---|
139 | well as for the following data archiving and transfer of result data |
---|
140 | etc. (as long as this is part of the job). Thus, as soon as the |
---|
141 | remaining job time is less than <b>termination_time_needed</b>, the |
---|
142 | model stops |
---|
143 | the time step procedure and copies the data for a restart run to the |
---|
144 | local binary file BINOUT. The so-called initialization parameters are |
---|
145 | also written to this file (see <a href="chapter_4.0.html">chapter |
---|
146 | 4.0</a>). In a last step the model produces another file with the |
---|
147 | local name CONTINUE_RUN. The presence of this file signals <b>mrun</b> |
---|
148 | the fact that a restart run must be started and leads to the |
---|
149 | start of an appropriate job. </p> |
---|
150 | <p style="line-height: 100%;"><font color="#000000">During the initial |
---|
151 | phase of a restart ru</font>n different actions than during the initial |
---|
152 | phase of an initial run of the model are neccessary. In this |
---|
153 | case the model must read in the binary data written by the preceding |
---|
154 | run at the beginning of the run. Beyond that it also reads the |
---|
155 | initialization parameters from this file. Therefore these do not need |
---|
156 | to be indicated in the parameter file (local name <a |
---|
157 | href="chapter_3.4.html#PARIN">PARIN</a>). |
---|
158 | If they are indicated nevertheless and if their value deviates from |
---|
159 | their value of the initial run, then this is ignored. There is |
---|
160 | exactly one exception to this rule: with the help of the |
---|
161 | initialization parameter <a |
---|
162 | href="chapter_4.1.html#initializing_actions">initializing_actions</a> |
---|
163 | it is determined whether the job is a restart run or an |
---|
164 | initial run. If <b>initializing_actions</b> = |
---|
165 | <i>read_restart_data</i>, then it is a restart |
---|
166 | run, otherwise an initial run. The previous remarks make it |
---|
167 | clear that the model obviously needs two different parameter files |
---|
168 | (local name PARIN) for the case of job chains. One is needed for the |
---|
169 | initial run and contains all initialization parameters set by |
---|
170 | the user and the other one is needed for restart runs. The |
---|
171 | last one only contains the initialization parameter |
---|
172 | <b>initializing_actions</b> (also, initialization |
---|
173 | parameters with values different from the initial run may appear in |
---|
174 | this file, but they will be ignored), which |
---|
175 | must have the value <i>read_restart_data</i>. |
---|
176 | Therefore the user must produce two different parameter files if he |
---|
177 | wants to operate job chains. Since the model always expects the |
---|
178 | parameter file on the local file <tt>PARIN</tt>, two different file |
---|
179 | connection statements must be given for this file in the |
---|
180 | configuration file. One may be active only at the initial run, |
---|
181 | the other one only at restart runs. The <b>mrun </b>call for the |
---|
182 | initial run shown above activates the first of the two |
---|
183 | specified connection statements, because the character string <tt><font |
---|
184 | style="font-size: 10pt;" size="2">d3#</font></tt> |
---|
185 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
186 | coincides with the character |
---|
187 | string in the third column of the connection statement. Obviously |
---|
188 | the next statement must be active</p> |
---|
189 | <ul> |
---|
190 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
191 | style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font></pre> |
---|
192 | </ul> |
---|
193 | <p style="line-height: 100%;">with the restart runs. Given that t<font |
---|
194 | color="#000000">his statement only gets</font> active if the option <tt><font |
---|
195 | style="font-size: 10pt;" size="2">-r</font></tt> is given the value |
---|
196 | <tt><font style="font-size: 11pt;" size="2">d3f</font></tt> and that |
---|
197 | the <b>mrun</b> call for this restart run is produced |
---|
198 | automatically (thus not by the user), <b>mrun</b> obviously has to |
---|
199 | replace <tt><font style="font-size: 10pt;" size="2">"d3#"</font></tt> |
---|
200 | of the initial run with <tt><tt><font style="font-size: 10pt;" size="2">"d3f"</font> |
---|
201 | </tt></tt>within the call of this restart run. Actually, with restart |
---|
202 | runs all <tt><font style="font-size: 10pt;" size="2">"#"</font></tt> |
---|
203 | characters within the strings given for the options <tt><font |
---|
204 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font |
---|
205 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace"> |
---|
206 | , </font></font><tt><font style="font-size: 10pt;" size="2"><font |
---|
207 | face="Cumberland, monospace">-i</font></font></tt> |
---|
208 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> are |
---|
209 | replaced by <tt><font style="font-size: 10pt;" size="2">f</font></tt>. |
---|
210 | </p> |
---|
211 | <p style="line-height: 100%;">For example, for the initial run |
---|
212 | the permanent file </p> |
---|
213 | <ul> |
---|
214 | <pre style="margin-bottom: 0.5cm; line-height: 100%;">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d</pre> |
---|
215 | </ul> |
---|
216 | <p style="line-height: 100%;">and for restart runs the permanent file<span |
---|
217 | style="font-family: monospace;"> </span></p> |
---|
218 | <ul style="font-family: monospace;"> |
---|
219 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
220 | style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3df</font></pre> |
---|
221 | </ul> |
---|
222 | <p style="line-height: 100%;">is used. Only with restart runs the |
---|
223 | local file <tt>BININ</tt> is made available as input file, because |
---|
224 | the appropriate file connection statement also contains the |
---|
225 | character string <tt><font style="font-size: 10pt;" size="2">"d3f"</font></tt> |
---|
226 | in the third column. This is logical and necessary since in BININ the |
---|
227 | binary data, produced by the model of the preceding job of the chain, |
---|
228 | are expected and the initial run does not need these |
---|
229 | data The permanent names of this input file (local name BININ) and |
---|
230 | the corresponding output file (local name BINOUT) are identical and |
---|
231 | read </p> |
---|
232 | <ul> |
---|
233 | <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font |
---|
234 | style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/OUTPUT/abcde_d3d.</font></pre> |
---|
235 | </ul> |
---|
236 | <p style="line-height: 100%;">However, after the file produced by the |
---|
237 | previous job was read in by the model and after at the local file |
---|
238 | <tt>BINOUT </tt>was produced at the end of the job, the |
---|
239 | restart job does not overwrite this permanent file (<tt>
/<font |
---|
240 | style="font-size: 10pt;" size="2">abcde_d3d</font></tt>) |
---|
241 | with the new data. Instead of that, it is examined whether already |
---|
242 | a permanent file with the name <tt><font style="font-size: 10pt;" |
---|
243 | size="2">
/abcde_d3d</font> |
---|
244 | <font face="Thorndale, serif">exists </font></tt>when copying the |
---|
245 | output file (<tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>) |
---|
246 | of <b>mrun</b>. If this is the case, <tt><font |
---|
247 | style="font-size: 10pt;" size="2">BINOUT</font></tt> |
---|
248 | is copied to the file<font style="font-size: 10pt;" size="2"><font |
---|
249 | face="Cumberland, monospace"> |
---|
250 | </font></font><tt><font style="font-size: 10pt;" size="2"><font |
---|
251 | face="Cumberland, monospace">
/abcde_d3d.1</font></font></tt><font |
---|
252 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">.</font></font> |
---|
253 | Even if this file is already present, <tt><font |
---|
254 | style="font-size: 10pt;" size="2">
/abcde_d3d.2</font></tt> |
---|
255 | is tried etc. For an input file the highest existing cycle |
---|
256 | of the respective permanent file is copied. In the example above this |
---|
257 | means: the initial run creates the permanent file |
---|
258 | <tt><font style="font-size: 10pt;" size="2">
/abcde_d3d</font></tt><font |
---|
259 | style="font-size: 11pt;" size="2">,</font> |
---|
260 | the first restart run uses this file and creates <tt>
/<font |
---|
261 | style="font-size: 10pt;" size="2">abcde_d3d.1</font></tt>, |
---|
262 | the second restart run creates <tt><font style="font-size: 10pt;" |
---|
263 | size="2">
/abcde_d3d.2</font></tt><font style="font-size: 10pt;" |
---|
264 | size="2"> |
---|
265 | </font>etc. After completion of the job chain the user can still |
---|
266 | access all files created by the jobs. This makes it possible for the |
---|
267 | user for example to restart the model run of a certain job of the job |
---|
268 | chain again. </p> |
---|
269 | <p style="line-height: 100%;">Therefore restart jobs can not only be |
---|
270 | started automatically through <b>mrun</b>, but also manually by the |
---|
271 | user. This is necessary e.g. whenever after the end of a job chain |
---|
272 | it is decided that the simulation must be continued further, because |
---|
273 | the phenomenon which should be examined did not reach the desired |
---|
274 | state yet. In such cases the <b>mrun</b> options completely |
---|
275 | correspond to those of the initial call; simply the <tt><font |
---|
276 | style="font-size: 10pt;" size="2">"#"</font></tt> characters in the |
---|
277 | arguments of options <tt><font style="font-size: 10pt;" size="2"><font |
---|
278 | face="Cumberland, monospace">-r</font></font></tt><font |
---|
279 | style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">, |
---|
280 | </font></font><tt><font style="font-size: 10pt;" size="2"><font |
---|
281 | face="Cumberland, monospace">-i</font></font></tt> |
---|
282 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> must be |
---|
283 | replaced by <tt><font style="font-size: 10pt;" size="2">"f"</font></tt>. |
---|
284 | </p> |
---|
285 | <hr> |
---|
286 | <p style="line-height: 100%;"><br> |
---|
287 | <font color="#000080"><font color="#000080"><a href="chapter_3.2.html"><font |
---|
288 | color="#000080"><img src="left.gif" name="Grafik1" align="bottom" |
---|
289 | border="2" height="32" width="32"></font></a><a href="index.html"><font |
---|
290 | color="#000080"><img src="up.gif" name="Grafik2" align="bottom" |
---|
291 | border="2" height="32" width="32"></font></a><a href="chapter_3.4.html"><font |
---|
292 | color="#000080"><img src="right.gif" name="Grafik3" align="bottom" |
---|
293 | border="2" height="32" width="32"></font></a></font></font></p> |
---|
294 | <p style="line-height: 100%;"><i>Last change: </i> 14/04/05 (SR)</p> |
---|
295 | </body> |
---|
296 | </html> |
---|