Version 6 (modified by maronga, 10 years ago) (diff)

--

PALM watchdog

From revision r1611 a batch job monitoring tool (watchdog), called palm_wd is available. It is based on python 2.7 and Qt4.

No image "palm_logo_wd.ico" attached to doc/app/palm_wd

Configuration of the watchdog

The watchdog consists of two scripts, palm_wd (watchdog client to be run on the local host) and palm_wdd (server to be located on each remote host to be monitored). Before running the watchdog, both client and server require system-specific configurations:

  1. in palm_wd, create one item for each remote host in the following three lists hostname, username and description, e.g.
    hostname     = ["hlogin.hlrn.de", "blogin.hlrn.de"]
    username     = ["nikname"       , "nikname"       ]
    description  = ["Hannover"      , "Berlin"        ]
    
    here hostname is the IP of the remote host (assuming that a passwordless login via ssh-key is available), username is the user name on the remote host, and description is an arbritrary string to identify the host.

Additionally, the update_frequency can be adjusted:

update_frequency = 600*1000
  1. in palm_wdd, system-specific configurations must be made. The default is configured to be used on the Cray-XC40 at HLRN-III and reads
    cmd_readqueue      = "showq | egrep "
    cmd_tmpdir         = "/gfs1/tmp/"
    cmd_canceljob      = "canceljob"
    cmd_checkjob       = "checkjob"
    cmd_realname_grep  = "AName"
    cmd_starttime      = "showstart"
    cmd_starttime_grep = "start in"
    
    For other hosts, the parameters above must be adjusted appropriately.
  1. Copy palm_wdd into the $HOME directory of each of the remote hosts, i.e. for HLRN-III:
    scp palm_wdd, nikname@hlogin.hlrn.de
    scp palm_wdd, nikname@blogin.hlrn.de
    
  1. Create database files for the watchdog in your working directory:
    cp $PALM_BIN/palm_wd_files/.wd.olddata $HOME/palm/current_version
    cp $PALM_BIN/palm_wd_files/.wd.newdata $HOME/palm/current_version
    

Running the watchdog

The watchdog can be either started by typing

palm_wd

into the shell, or via the mrungui (Start -> Start watchdog). A window (see screenshot) should appear on the screen.

palm_wd screenshot

Attachments (2)

Download all attachments as: .zip