Monitoring batch jobs: PALM watchdog

From revision r1611 a batch job monitoring tool (watchdog), called palm_wd is available. It is based on python 2.7 and Qt4. For using the watchdog it is essential to have Qt and python version 2.7 or higher running on the local as well as on each remote host to be monitored. Please note that the watchdog will show only those jobs that have been captured before, i.e. very short jobs and job protocol transfer jobs might not be shown.

palm_wd logo

Configuration of the watchdog

The watchdog consists of two scripts, palm_wd (watchdog client to be run on the local host) and palm_wdd (server to be located on each remote host to be monitored). Before running the watchdog, both client and server require system-specific configuration that have to been provided in configuration files .wd.config and {.wdd.config}}}:

  1. Make a copy of trunk/SCRIPTS/palm_wd_files/.wd.config.default, rename it to .wd.config, and move it to your local palm directory (e.g. ~/palm/current_version/, then edit the file (e.g. here for HLRN-III):
    [Hannover]
    hostname=hlogin.hlrn.de
    username=<replace_by_your_remote_username>
    
    [Berlin]
    hostname=blogin.hlrn.de
    username=<replace_by_your_remote_username>
    
    [Settings]
    update_frequency=10
    

For each remote host to be monitored, create a separate section with Description of your choice (here "Hannover" and "Berline"). hostname is the IP/name of the remote host (assuming that a passwordless login via ssh-key is available), username is the user name on the remote host. The automatical update frequency must be given in minutes.

  1. palm_wdd requires system-specific configurations. Make a copy of trunk/SCRIPTS/palm_wd_files/.wdd.config.default for each host to be monitored, rename it to .wdd.config, and edit the files appropriately. For HLRN-III .wdd.config reads:
    [Settings]
    readqueue="showq | egrep"
    tmpdir="/gfs1/tmp/"
    canceljob="canceljob"
    checkjob="checkjob"
    realname_grep="AName"
    starttime="showstart"
    starttime_grep="start in"
    

As the queuing system on different computing systems may vary, it is not possible to provide detailed instructions how to set this configuration. In case you are struggling with the configuration, please feel free to create a new ticket

  1. Now copy palm_wdd and the configuration files into the $HOME directory of each of the remote hosts, i.e. for HLRN-III:
    scp palm_wdd nikname@hlogin.hlrn.de:
    scp .wdd.config.hlrnIII nikname@hlogin.hlrn.de:.wdd.config
    scp palm_wdd nikname@blogin.hlrn.de:
    scp .wdd.config.hlrnIII nikname@blogin.hlrn.de:.wdd.config
    

Running the watchdog

The watchdog can be either started by typing

palm_wd

into the shell, or via the palmrungui (Start -> Start watchdog). A window (see screenshot) should appear on the screen.

palm_wd screenshot

Documentation

The watchdog is to large extent self-explanatory. The following features, however, might require a short description.

Progress bar

The progress bar displays the progress of each job that is currently running (status "Running"). The progress is calculated from the current simulation time and the end_time/restart_time of the job. PALM write these information in the file PROGRESS that resides in the temporary directory of the job.

Right mouse click: Show details

Information on the job from the queuing system is displayed

Right mouse click: Cancel job

The job will be canceled without any security query.

Right mouse click: Force stop

Unlike "Cancel job" the job will initiate a proper termination of the run, i.e. data will be processed and possibly restart binary data will be written.

Right mouse click: Force restart

The job will initiate a proper termination of the run, followed by an automatic restart of the job chain. This will only work properly, when the job has been configured for restart runs, otherwise this action is identical to "Force stop".

Status: "Completed."

A jobs with status "Completed." has not necessarily finished successfully. All jobs that have been removed from the queuing system (either by cancelation, finishing, or crashing) are labeled as "Completed.". The Progress bar for completed jobs will show the last value obtained by the watchdog.

Tooltips on "Job" and "Remaining time"

When hovering over an item in the "Job" column, the queuing name will be shown. When hovering over items in the "Remaining time" column, the estimated start time of the queuing system will be shown.

Last modified 6 years ago Last modified on Nov 20, 2018 5:19:37 PM

Attachments (2)

Download all attachments as: .zip