Version 13 (modified by maronga, 10 years ago) (diff) |
---|
Monitoring batch jobs: PALM watchdog
From revision r1611 a batch job monitoring tool (watchdog), called palm_wd is available. It is based on python 2.7 and Qt4. For using the watchdog it is essential to have Qt and python version 2.7 or higher running on the local as well as on each remote host to be monitored.
Configuration of the watchdog
The watchdog consists of two scripts, palm_wd (watchdog client to be run on the local host) and palm_wdd (server to be located on each remote host to be monitored). Before running the watchdog, both client and server require system-specific configuration:
- in palm_wd, create one item for each remote host in the following three lists hostname, username and description, e.g.
hostname = ["hlogin.hlrn.de", "blogin.hlrn.de"] username = ["nikname" , "nikname" ] description = ["Hannover" , "Berlin" ]
here hostname is the IP of the remote host (assuming that a passwordless login via ssh-key is available), username is the user name on the remote host, and description is an arbritrary string to identify the host.
Additionally, the update_frequency can be adjusted:
update_frequency = 600*1000
- in palm_wdd, system-specific configurations must be made. The default is configured to be used on the Cray-XC40 at HLRN-III and reads
cmd_readqueue = "showq | egrep " cmd_tmpdir = "/gfs1/tmp/" cmd_canceljob = "canceljob" cmd_checkjob = "checkjob" cmd_realname_grep = "AName" cmd_starttime = "showstart" cmd_starttime_grep = "start in"
For other hosts, the parameters above must be adjusted appropriately.
- Copy palm_wdd into the $HOME directory of each of the remote hosts, i.e. for HLRN-III:
scp palm_wdd, nikname@hlogin.hlrn.de scp palm_wdd, nikname@blogin.hlrn.de
- Create database files for the watchdog in your working directory:
cp $PALM_BIN/palm_wd_files/.wd.olddata $HOME/palm/current_version cp $PALM_BIN/palm_wd_files/.wd.newdata $HOME/palm/current_version
Running the watchdog
The watchdog can be either started by typing
palm_wd
into the shell, or via the mrungui (Start -> Start watchdog). A window (see screenshot) should appear on the screen.
Documentation
The watchdog is to large extent self-explanatory. The following features, however, require a short description.
Progress bar
The progress bar displays the progress of each job that is currently running (status "Running"). The progress is calculated from the current simulation time and the end_time/restart_time of the job. PALM write these information in the file PROGRESS that resides in the temporary directory of the job.
Right mouse click: Show details
Information on the job from the queuing system is displayed
Right mouse click: Cancel job
The job will be canceled without any security query.
Right mouse click: Force stop
Unlike "Cancel job" the job will initiate a proper termination of the run, i.e. data will be processed and possibly restart binary data will be written.
Right mouse click: Force restart
The job will initiate a proper termination of the run, followed by an automatic restart of the job chain. This will only work properly, when the job has been configured for restart runs, otherwise this action is identical to "Force stop".
Status: "Completed."
A jobs with status "Completed." has not necessarily finished successfully. All jobs that have been removed from the queuing system (either by cancelation, finishing, or crashing) are labeled as "Completed.".
Attachments (2)
-
palm_wd_action.png
(54.5 KB) -
added by maronga 10 years ago.
palm_wd screenshot
-
palm_wd.png
(48.4 KB) -
added by maronga 10 years ago.
palm_wd logo
Download all attachments as: .zip