1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
---|
2 | <html><head> |
---|
3 | <meta http-equiv="CONTENT-TYPE" content="text/html; charset=windows-1252"><title>PALM |
---|
4 | chapter 3.3</title> <meta name="GENERATOR" content="StarOffice 7 (Win32)"> <meta name="AUTHOR" content="Marcus Oliver Letzel"> <meta name="CREATED" content="20040728;14053490"> <meta name="CHANGED" content="20041112;14150257"> <meta name="KEYWORDS" content="parallel LES model"> <style> |
---|
5 | <!-- |
---|
6 | @page { size: 21cm 29.7cm } |
---|
7 | --> |
---|
8 | </style></head> |
---|
9 | |
---|
10 | <body style="direction: ltr;" lang="en-US"><h3 style="line-height: 100%;">3.3 Initialization and restart |
---|
11 | runs</h3> |
---|
12 | <p style="line-height: 100%;">A job started by <b>mrun</b> |
---|
13 | will |
---|
14 | - according to its requested computing time, its memory size |
---|
15 | requirement and |
---|
16 | the number of necessary processing elements (on parallel computers) - |
---|
17 | be queued by the queuing-system of the remote computer into a suitable |
---|
18 | job |
---|
19 | class which fulfills these requirements. Each job class permits only |
---|
20 | jobs with certain maximum requirements (e.g. |
---|
21 | the job class <tt><font style="font-size: 11pt;" size="2">cdev</font></tt> |
---|
22 | on the IBM Regatta "hanni" of the HLRN permits only jobs with no more |
---|
23 | than 7200 seconds required computing time and with using no more than |
---|
24 | 32 |
---|
25 | processing elements). The job classes are important for the scheduling |
---|
26 | process of the computer. Jobs with small requirements usually |
---|
27 | come to execution |
---|
28 | very fast, jobs with higher requirements must wait longer (sometimes |
---|
29 | several days). </p> |
---|
30 | <p style="line-height: 100%;">Before the start of a model |
---|
31 | run the user |
---|
32 | must estimate how much CPU time the model will need for the simulation. |
---|
33 | The necessary time in seconds has to be indicated with the mrun |
---|
34 | <b>option</b> <tt><a href="http://www.muk.uni-hannover.de/institut/software/mrun_beschreibung.html#Opt-t">-t</a></tt> |
---|
35 | and has an influence on the job class into which the job is queued. Due |
---|
36 | to the fact that the model usually uses a variable |
---|
37 | time step and thus the number of time steps to be executed and |
---|
38 | consequently the time needed by the model is not |
---|
39 | known at the beginning, this can be measured only very roughly in |
---|
40 | many cases. So it may happen that the model needs more time than |
---|
41 | indicated for the option <tt><u><font style="font-size: 10pt;" size="2">-t</font></u>,</tt> |
---|
42 | which normally leads to an abort of the job as soon as the available |
---|
43 | CPU time is consumed. In principle one could solve this problem by |
---|
44 | setting a very generously estimated value for <u><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-t</font></font></u>, |
---|
45 | but this will possibly lead to the disadvantage that the queued job has |
---|
46 | to wait longer for execution.<br> |
---|
47 | </p><p style="line-height: 100%;">To avoid this |
---|
48 | problem <b>mrun </b>offers |
---|
49 | the possibility of so-called <b>restart runs</b>. During |
---|
50 | the model |
---|
51 | run PALM continuously examines how much time is left for the |
---|
52 | execution of the job. If the run is not completed and finished shortly |
---|
53 | before |
---|
54 | expiration of this time, the model stops and writes down the values |
---|
55 | of (nearly) all model variables in binary form to a file (local name |
---|
56 | <a href="chapter_3.4.html#BINOUT">BINOUT</a>). |
---|
57 | After copying the output files required by the user, <b>mrun</b> |
---|
58 | automatically starts a restart run. For this purpose a new <b>mrun</b> |
---|
59 | call is set off automatically on the local computer of the user; <b>mrun</b> |
---|
60 | thus calls itself. The options with this call correspond to a large |
---|
61 | extent to those which the user had selected with his initial call of <b>mrun</b>. |
---|
62 | The model restarts and this time at the beginning it reads in the |
---|
63 | binary data written before and continues the run with them. If in |
---|
64 | this job the CPU time is not sufficient either, in order to terminate |
---|
65 | the run, at the end of the job another restart run is started, etc., |
---|
66 | until the time which shall be simulated by the model, is reached. |
---|
67 | Thus a set of restart runs can develop - a so-called job chain. The |
---|
68 | first run of this chain (model start at t=0) is called |
---|
69 | <b>initial run</b>. </p> |
---|
70 | <p style="line-height: 100%;">Working with restart runs |
---|
71 | and their |
---|
72 | generation through <b>mrun</b> requires certain entries in |
---|
73 | the |
---|
74 | mrun-configuration file and in the parameter file, which are |
---|
75 | described and explained in the following. The configuration file must |
---|
76 | contain the following entries (example for the IBM Regatta of the |
---|
77 | HLRN): </p> |
---|
78 | <ul> <pre style="line-height: 100%;"><font style="font-size: 10pt;" size="2">%write_binary true restart</font><br><font style="font-size: 10pt;" size="2">#</font><br><a href="chapter_3.4.html#PARIN"><font style="font-size: 10pt;" size="2">PARIN</font></a><font style="font-size: 10pt;" size="2"> in:job:npe d3# ~/palm/current_version/JOBS/$fname/INPUT _p3d</font><br><font style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font><br><a href="chapter_3.4.html#BININ"><font style="font-size: 10pt;" size="2">BININ</font></a><font style="font-size: 10pt;" size="2"> in:loc d3f ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font><br><font style="font-size: 10pt;" size="2">#</font><br><a href="chapter_3.4.html#BINOUT"><font style="font-size: 10pt;" size="2">BINOUT</font></a><font style="font-size: 10pt;" size="2"> out:loc restart ~/palm/current_version/JOBS/$fname/OUTPUT _d3d</font></pre></ul> |
---|
79 | <p style="line-height: 100%;">The <b>mrun</b> |
---|
80 | call for the |
---|
81 | initialization run of the job chain must look as follows: </p> |
---|
82 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">mrun -h ibmh -d abcde -t 900 -r "d3# restart"</font></pre></ul> |
---|
83 | <p style="line-height: 100%;">The specification of the |
---|
84 | environment |
---|
85 | variable <tt><tt><font style="font-size: 10pt;" size="2">writ</font></tt></tt><tt><tt><font style="font-size: 10pt;" size="2">e_binary</font><font style="font-size: 11pt;" size="2">, </font></tt></tt>which |
---|
86 | must be |
---|
87 | assigned the value <tt><tt><font style="font-size: 10pt;" size="2">true</font></tt></tt>, |
---|
88 | is essential. Only in this case the model writes |
---|
89 | binary-coded data for a possible restart run to the local file <tt><tt><a href="chapter_3.4.html#BINOUT">BINOUT</a></tt></tt> |
---|
90 | at the end of the run. Then of course this output file must be stored |
---|
91 | on a permanent file with an appropriate file connection statement |
---|
92 | (last line of the example above). As you can see, both instructions |
---|
93 | (variable declaration and connection statements) are only carried out |
---|
94 | by <b>mrun</b>, if the character string <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
95 | is given for the option <tt><font style="font-size: 10pt;" size="2">-r</font> |
---|
96 | </tt>in the <span style="font-weight: bold;">mrun</span> |
---|
97 | call. Thus |
---|
98 | the example above can also be used |
---|
99 | if no restart runs are intended. In such cases the character string |
---|
100 | <tt><tt><font style="font-size: 10pt;" size="2">restart</font></tt></tt> |
---|
101 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
102 | can simply be omitted. </p> |
---|
103 | <p style="line-height: 100%;">Only by the specification of |
---|
104 | <tt><font style="font-size: 10pt;" size="2">write_binary=true</font><font style="font-size: 11pt;" size="2"> |
---|
105 | </font><font face="Thorndale, serif">the</font></tt> |
---|
106 | model is |
---|
107 | instructed to compute the remaining CPU time after each time step and |
---|
108 | stop, if the run is not going to be completed and finished briefly |
---|
109 | before expiration of |
---|
110 | this time. Actually the stop takes place when the |
---|
111 | difference from the available job time (determined by the <b>mrun</b> |
---|
112 | option <tt><font style="font-size: 10pt;" size="2">-t</font></tt>) |
---|
113 | and |
---|
114 | the time used so far by the job becomes smaller than the time given |
---|
115 | by the model variable <a href="chapter_4.2.html#termination_time_needed">termination_time_needed</a>. |
---|
116 | With the variable <b>termination_time_needed </b>the user |
---|
117 | determines, |
---|
118 | how much time is needed for binary copying of the data for restart |
---|
119 | runs, as |
---|
120 | well as for the following data archiving and transfer of result data |
---|
121 | etc. (as long as this is part of the job). Thus, as soon as the |
---|
122 | remaining job time is less than <b>termination_time_needed</b>, |
---|
123 | the |
---|
124 | model stops |
---|
125 | the time step procedure and copies the data for a restart run to the |
---|
126 | local binary file BINOUT. The so-called initialization parameters are |
---|
127 | also written to this file (see <a href="chapter_4.0.html">chapter |
---|
128 | 4.0</a>). In a last step the model produces another file with the |
---|
129 | local name CONTINUE_RUN. The presence of this file signals <b>mrun</b> |
---|
130 | the fact that a restart run must be started and leads to the |
---|
131 | start of an appropriate job. </p> |
---|
132 | <p style="line-height: 100%;"><font color="#000000">During |
---|
133 | the initial |
---|
134 | phase of a restart ru</font>n different actions than during the |
---|
135 | initial |
---|
136 | phase of an initial run of the model are neccessary. In this |
---|
137 | case the model must read in the binary data written by the preceding |
---|
138 | run at the beginning of the run. Beyond that it also reads the |
---|
139 | initialization parameters from this file. Therefore these do not need |
---|
140 | to be indicated in the parameter file (local name <a href="chapter_3.4.html#PARIN">PARIN</a>). |
---|
141 | If they are indicated nevertheless and if their value deviates from |
---|
142 | their value of the initial run, then this is ignored. There is |
---|
143 | exactly one exception to this rule: with the help of the |
---|
144 | initialization parameter <a href="chapter_4.1.html#initializing_actions">initializing_actions</a> |
---|
145 | it is determined whether the job is a restart run or an |
---|
146 | initial run. If <b>initializing_actions</b> = |
---|
147 | “<i>read_restart_data”</i>, then it is |
---|
148 | a restart |
---|
149 | run, otherwise an initial run. The previous remarks make it |
---|
150 | clear that the model obviously needs two different parameter files |
---|
151 | (local name PARIN) for the case of job chains. One is needed for the |
---|
152 | initial run and contains all initialization parameters set by |
---|
153 | the user and the other one is needed for restart runs. The |
---|
154 | last one only contains the initialization parameter |
---|
155 | <b>initializing_actions</b> (also, initialization |
---|
156 | parameters with values different from the initial run may appear in |
---|
157 | this file, but they will be ignored), which |
---|
158 | must have the value “<i>read_restart_data”</i>. |
---|
159 | Therefore the user must produce two different parameter files if he |
---|
160 | wants to operate job chains. Since the model always expects the |
---|
161 | parameter file on the local file <tt>PARIN</tt>, two |
---|
162 | different file |
---|
163 | connection statements must be given for this file in the |
---|
164 | configuration file. One may be active only at the initial run, |
---|
165 | the other one only at restart runs. The <b>mrun </b>call |
---|
166 | for the |
---|
167 | initial run shown above activates the first of the two |
---|
168 | specified connection statements, because the character string <tt><font style="font-size: 10pt;" size="2">d3#</font></tt> |
---|
169 | with the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> |
---|
170 | coincides with the character |
---|
171 | string in the third column of the connection statement. Obviously |
---|
172 | the next statement must be active</p> |
---|
173 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">PARIN in:job:npe d3f ~/palm/current_version/JOBS/$fname/INPUT _p3df</font></pre></ul> |
---|
174 | <p style="line-height: 100%;">with the restart runs. Given |
---|
175 | that t<font color="#000000">his statement only gets</font> |
---|
176 | active if the option <tt><font style="font-size: 10pt;" size="2">-r</font></tt> is given the value |
---|
177 | <tt><font style="font-size: 11pt;" size="2">d3f</font></tt> |
---|
178 | and that |
---|
179 | the <b>mrun</b> call for this restart run is produced |
---|
180 | automatically (thus not by the user), <b>mrun</b> |
---|
181 | obviously has to |
---|
182 | replace <tt><font style="font-size: 10pt;" size="2">"d3#"</font></tt> |
---|
183 | of the initial run with <tt><tt><font style="font-size: 10pt;" size="2">"d3f"</font> |
---|
184 | </tt></tt>within the call of this restart run. Actually, |
---|
185 | with restart |
---|
186 | runs all <tt><font style="font-size: 10pt;" size="2">"#"</font></tt> |
---|
187 | characters within the strings given for the options <tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace"> |
---|
188 | , </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-i</font></font></tt> |
---|
189 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> |
---|
190 | are |
---|
191 | replaced by <tt><font style="font-size: 10pt;" size="2">“f”</font></tt>. |
---|
192 | </p> |
---|
193 | <p style="line-height: 100%;">For example, for the initial |
---|
194 | run |
---|
195 | the permanent file </p> |
---|
196 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d</pre></ul> |
---|
197 | <p style="line-height: 100%;">and for restart runs the |
---|
198 | permanent file<span style="font-family: monospace;"> </span></p> |
---|
199 | <ul style="font-family: monospace;"> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/INPUT/abcde_p3df</font></pre></ul> |
---|
200 | <p style="line-height: 100%;">is used. Only with restart |
---|
201 | runs the |
---|
202 | local file <tt>BININ</tt> is made available as input file, |
---|
203 | because |
---|
204 | the appropriate file connection statement also contains the |
---|
205 | character string <tt><font style="font-size: 10pt;" size="2">"d3f"</font></tt> |
---|
206 | in the third column. This is logical and necessary since in BININ the |
---|
207 | binary data, produced by the model of the preceding job of the chain, |
---|
208 | are expected and the initial run does not need these |
---|
209 | data The permanent names of this input file (local name BININ) and |
---|
210 | the corresponding output file (local name BINOUT) are identical and |
---|
211 | read </p> |
---|
212 | <ul> <pre style="margin-bottom: 0.5cm; line-height: 100%;"><font style="font-size: 10pt;" size="2">~/palm/current_version/JOBS/abcde/OUTPUT/abcde_d3d.</font></pre></ul> |
---|
213 | <p style="line-height: 100%;">However, after the file |
---|
214 | produced by the |
---|
215 | previous job was read in by the model and after at the local file |
---|
216 | <tt>BINOUT </tt>was produced at the end of the job, the |
---|
217 | restart job does not overwrite this permanent file (<tt>…/<font style="font-size: 10pt;" size="2">abcde_d3d</font></tt>) |
---|
218 | with the new data. Instead of that, it is examined whether already |
---|
219 | a permanent file with the name <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d</font> |
---|
220 | <font face="Thorndale, serif">exists </font></tt>when |
---|
221 | copying the |
---|
222 | output file (<tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt>) |
---|
223 | of <b>mrun</b>. If this is the case, <tt><font style="font-size: 10pt;" size="2">BINOUT</font></tt> |
---|
224 | is copied to the file<font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace"> |
---|
225 | </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">…/abcde_d3d.1</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">.</font></font> |
---|
226 | Even if this file is already present, <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d.2</font></tt> |
---|
227 | is tried etc. For an input file the highest existing cycle |
---|
228 | of the respective permanent file is copied. In the example above this |
---|
229 | means: the initial run creates the permanent file |
---|
230 | <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d</font></tt><font style="font-size: 11pt;" size="2">,</font> |
---|
231 | the first restart run uses this file and creates <tt>…/<font style="font-size: 10pt;" size="2">abcde_d3d.1</font></tt>, |
---|
232 | the second restart run creates <tt><font style="font-size: 10pt;" size="2">…/abcde_d3d.2</font></tt><font style="font-size: 10pt;" size="2"> |
---|
233 | </font>etc. After completion of the job chain the user can still |
---|
234 | access all files created by the jobs. This makes it possible for the |
---|
235 | user for example to restart the model run of a certain job of the job |
---|
236 | chain again. </p> |
---|
237 | <p style="line-height: 100%;">Therefore restart jobs can |
---|
238 | not only be |
---|
239 | started automatically through <b>mrun</b>, but also |
---|
240 | manually by the |
---|
241 | user. This is necessary e.g. whenever after the end of a job chain |
---|
242 | it is decided that the simulation must be continued further, because |
---|
243 | the phenomenon which should be examined did not reach the desired |
---|
244 | state yet. In such cases the <b>mrun</b> options |
---|
245 | completely |
---|
246 | correspond to those of the initial call; simply the <tt><font style="font-size: 10pt;" size="2">"#"</font></tt> |
---|
247 | characters in the |
---|
248 | arguments of options <tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-r</font></font></tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">, |
---|
249 | </font></font><tt><font style="font-size: 10pt;" size="2"><font face="Cumberland, monospace">-i</font></font></tt> |
---|
250 | and <tt><font style="font-size: 10pt;" size="2">-o</font></tt> |
---|
251 | must be |
---|
252 | replaced by <tt><font style="font-size: 10pt;" size="2">"f"</font></tt>. |
---|
253 | </p> |
---|
254 | <hr><p style="line-height: 100%;"><br> |
---|
255 | <font color="#000080"><font color="#000080"><a href="chapter_3.2.html"><font color="#000080"><img src="left.gif" name="Grafik1" align="bottom" border="2" height="32" width="32"></font></a><a href="index.html"><font color="#000080"><img src="up.gif" name="Grafik2" align="bottom" border="2" height="32" width="32"></font></a><a href="chapter_3.4.html"><font color="#000080"><img src="right.gif" name="Grafik3" align="bottom" border="2" height="32" width="32"></font></a></font></font></p><p style="line-height: 100%;"><i>Last change: </i> |
---|
256 | $Id: chapter_3.3.html 62 2007-03-13 02:52:40Z basit $</p> |
---|
257 | </body></html> |
---|