BEAR Technical Group notes for n

BEAR Service Group
16 September 2008

Present
Paul Hatton (IT Services) [PSH]
Alan Reed (IT Services) [AR]
Aslam Ghumra (IT Services) [AKG]
Jon Hunt (IT Services) [JBH]
Lawrie Lowe (Physics) [LL]

Apologies
None

Introduction

Notes from these meetings will be a concise summary of issues and actions. They will not detail the full discussions that took place in the meetings, nor will the order of issues and actions necessarily reflect the order of discussion.

Ongoing Actions and Matters Arising from previous meetings

Action JBH: investigate Citrix to supply a Windows-based Matlab service
JBH has discussed options for a general release of Matlab on Windows, with access to the Microsoft HPC Cluster, with PSH. A user service based on Citrix may be feasible; JBH is talking to Nick Foley who runs the Citrix service for finance about this.
Action LL: see if a minimum memory requirement can be specified to MOAB rather than torque
LL said that a minimum memory requirement can be specified in torque which would prevent jobs that do not request this additional memory from running on the nodes with the additional memory. AR would prefer to configure this in MOAB. LL said that a minimum memory requirement can be specified in torque which would prevent jobs that do not request this additional memory from running on these nodes. AR said that this can also be controlled with a qsub filter script in MOAB although LL had concerns about side-effects of qsub filters and the alternative of us providing a replacement qsub scrip.
PSH has submitted a job asking for 1 node, 4 cores and pvmem=10gb which should run on one of the added memory nodes but this just appears to queue.
Action LL: look at poor graphics performance from the slave nodes.
PSH noticed that the response from graphical applications run on a slave node via qsub -IX is noticeably worse than when run on a login node. LL has found that Matlab suffers from slow - around 10 seconds - window redraws when run in this way. This has been mentioned on a torque users' list in the past; LL re-posted the question but has had no reply yet. nedit didn't show this, so it may be Matlab-specific. LL will see if Abaqus/CAE shows the same behaviour.
Action LL/AR: refer any scheduling issues to ClusterResources
AR has an experimental release of MOAB which addresses the issue with jobs not running when released. We will not implement this until it is a full release from ClusterResources.
Action LL: collate and circulate requirements for grid-based access to the cluster
AR has reserved 16 nodes whilst we are implementing this; this could grow depending on demand
Action AR: collate suggestions for handling shared project space and set up a dummy shared project so that these ideas can be tested
There have been several suggestions as to how shared project quotas can be handled. The options will be explored before releasing a service. AR has set up a separate filesystem (/projects) that can be used for this, although at present it is used for tests on the backup system. PSH circulated a summary of notes he had make in discussion with AR some time ago which LL had concerns about; this action is first to collate all such suggestions for discussion/action.
Action AR: arrange for system-level housekeeping on /scratch
LL has confirmed that /scratch does show the last accessed time and has suggested a cron job to run housekeeping on /scratch. AR will implement this. We will notify users prior to the first use of this.
Action AR: continue discussions with Cluster Resources about trapping jobs that ask for an invalid ppn
LL suggested that any job that will never run should be trapped rather than queuing forever; AR will submit this as the enhancement request.
Action AR: remove mshow from general user access
The mshow and showq commands give a list of every user's job and is generally available. It was agreed earlier on in the service that users should only be able to view their own jobs, so these commands should not be generally available. AR questioned whether we should be using the torque-level commands such as qstat at all or if we should only use the MOAB-level ones such as mshow. The problem with commands such as qdel not recognising a job number may be due to domain issues; JBH said that in his experience DNS issues can have many manifestations. LL said that any command can be hidden.
Action AR: add directory for utilities under appmaint's control to the default PATH
AR has set up /usr/local/bham, which despite the name is available across the cluster, owned by appmaint for utilities such as Xfe (the graphical file manager) which shouldn't need a module load command. The bin subdirectory needs to be available to all users by default.
Action AR/AKG: consider mechanism for de-registering users
We know of 2 users who have left and so can be used to build experience of de-registering users.
Standing Action AR/AKG: Report any hardware/software faults that directly impact the service
1. AR has visited John Veitch in Physics who is submitting many thousands of jobs. He now has a cron job that checks how many of his jobs are in the system and submits the next batch when most have completed. Some of these jobs also appear to be locked due to them not being able to see the .bashrc file, which is probably a GPFS issue.
2. Action AR: prevent cron jobs on worker nodes
  AR has locked out cron jobs on the logon nodes; LL said that this should also be done on the worker nodes.
3. AKG said that offlined nodes still appear to accept batch jobs; we do not understand why this happens.
4. A brief campus-wide power outage on Friday resulted in the APC power strips on the unprotected supply losing contact with the machine room circuit breakers. These strips are designed to remain off when power is restored to them. The network switched are connected to these strips so AR had to bring each of these up manually. We are expecting a visit from estates on Thursday to look at this. PSH will raise the issue of why the cluster was set up with this failure mode with Clustervision
  Action AR: discuss options to have the switches on the protected power supply
  Action PSH: discuss cluster setup with Clustervision
  Resilience would be improved if the switches were on the protected supply; AR will talk to Kul Gill about this
5. one of the network switches keeps giving port errors; this doesn't prevent u4 being used but should not be happening. Clustervision have found an error on this switch which is also present on other ones.
6. Action AR: raise concern with Clustervision about DHCP dying
  we had DHCP issues; it was not running on filer1 or filer2. If DHCP is under the control of HA (High Availability) why wasn't this trapped? This was resolved by restarting DHCP on filer001
7. AR said that a recent backup had not started due to some processes that had been running on filer001 now running on filer002 - we are not sure why this has happened.
Action PSH: arrange conference call with ClusterResources as required
ClusterResources have indicated their willingness to take part in a conference call to discuss any outstanding issues.
Action PSH: contact potential Matlab user (Jihong Wang in Elec Eng)
PSH has contacted and will visit Jihong to advise on using both his own local license server and the central one.
Action PSH: finalise the allinea ddt configuration
Action PSH: produce help page for allinea ddt, and opt when available
The debugger allinea ddt has been installed and PSH/MM have been looking at the configuration. MM was working on the configuration files for the optimiser allinea opt and on integrating mpiexec with the optimiser - it expects ssh access. MM was in discussion with Allinea, including having a logon on their system, on this but there are some bugs that have not be fixed in the current release despite assurances from Allinea. AR said that Bristol are also pursuing this and we are probably best to await developments driven by Bristol.
PSH and MM were working on the configuration of the debugger, which PSH hopes can soon be released.
Action ALL: send PSH suggestions for parallel programming web resources
Links to parallel programming resources, such as help pages, tutorials and courses, are available on the BEAR help site in the 'Parallel programming' section.
Action ALL: suggest example programs and scripts to PSH
Action PSH: make these available to users
A set of simple example programs, and scripts for example for parallel programming, would be helpful to users.
Standing Action ALL: discuss any user issues
1. A helpdesk call has been logged about recovering accidentally-deleted files; AKG is working on this
2. AKG is also clarifying the procedure with the helpdesk for re-directing calls when he is on holiday - this didn't happen during his recent break.
Standing action LL/AR: present and discuss user statistics
None tabled this week

Any Other Business

Action AR/AKG: send PSH details of outstanding calls
PSH asked for a summary of outstanding calls with Clustervision and ClusterResources since he was concerned that we have an excessive number of such calls
Action LL: advise on how to introduce a delay between job submissions per user and a limit on the number of queued jobs/user ...
Action AR: ... and implement this
We do not know if it was the number of jobs, the rate of job submission or some other factor that has given recent problems with the job submission. AR asked if the number of queued jobs per user should be set to a high limit. LL said that we can specify a global qsub parameter to introduce a delay between job submissions per user and that we can set a per-user limit on the number of queued jobs per user.

Completed actions from the previous meeting and Items of Information

JBH has contacted Oxford about Oxford about limiting the number of cores that an interactive Matlab job can get on the MS cluster. They have also said that this cannot be done. PSH was talking to Mathworks at the e-Science All Hands meeting last week who may be able to provide a not-very-satisfactory workaround.
Hummingbird visited on 12 September and set up a demonstration system which has an X proxy running on PSH's Scientific Linux machine that allows X sessions to be preserved across logons. It also cuts down the traffic to the desktop.

Next meetings:

10.00 every second Tuesday in the Elms Road Demo Room unless notified of any changes.

BEAR Service Group 16 September 2008

BEAR Service Group
16 September 2008