BEAR Service Group
16 September 2008
Present
Paul Hatton (IT Services) [PSH]
Alan Reed (IT Services) [AR]
Aslam Ghumra (IT Services) [AKG]
Jon Hunt (IT Services) [JBH]
Lawrie Lowe (Physics) [LL]
Apologies
None
Introduction
Notes from these meetings will be a concise summary of issues and actions.
They will not detail the full discussions that took place in the meetings, nor
will the order of issues and actions necessarily reflect the order of
discussion.
Ongoing Actions and Matters Arising from previous meetings
- Action JBH: investigate Citrix to supply a Windows-based Matlab
service
JBH has discussed options for a general release of Matlab
on Windows, with access to the Microsoft HPC Cluster, with PSH. A user
service based on Citrix may be feasible; JBH is talking to Nick Foley who
runs the Citrix service for finance about this.
- Action LL: see if a
minimum memory requirement can be specified to MOAB rather than torque
LL said that a minimum memory
requirement can be specified in torque which would prevent jobs that do
not request this additional memory from running on the nodes with the
additional memory. AR would
prefer to configure this in MOAB. LL said that a minimum memory requirement
can be specified in torque which would prevent jobs that do not request this
additional memory from running on these nodes. AR said that this can also be
controlled with a qsub filter script in MOAB although LL had concerns about
side-effects of qsub filters and the alternative of us providing a
replacement qsub scrip.
PSH has submitted a job asking for 1 node, 4
cores and pvmem=10gb which should run on one of the added memory nodes but
this just appears to queue.
- Action LL: look at
poor graphics performance from the slave nodes.
PSH noticed that the response from graphical applications run on a
slave node via qsub -IX is noticeably worse than when run on a login
node. LL has found that Matlab suffers from slow - around 10 seconds -
window redraws when run in this way. This has been mentioned on a torque
users' list in the past; LL re-posted the question but has had no reply yet.
nedit didn't show this, so it may be Matlab-specific. LL will see if
Abaqus/CAE shows the same behaviour.
-
Action LL/AR: refer any
scheduling issues to
ClusterResources
AR has an experimental release of MOAB which
addresses the issue with jobs not running when released. We will not
implement this until it is a full release from ClusterResources.
- Action LL: collate and circulate requirements for grid-based access to the cluster
AR has reserved 16 nodes
whilst we are implementing this; this could grow depending on demand
- Action AR: collate
suggestions for handling shared project space and set up a dummy shared
project so that these ideas can be tested
There have been several suggestions as to how shared project quotas can be
handled. The options will be explored before releasing a service. AR has
set up a separate filesystem (/projects) that can be used for this,
although at present it is used for tests on the backup system. PSH
circulated a summary of notes he had make in discussion with AR some time
ago which LL had concerns about; this action is first to collate all such
suggestions for discussion/action.
- Action AR: arrange
for system-level housekeeping on /scratch
LL has confirmed that /scratch does show the last accessed time
and has suggested a cron job to run housekeeping on /scratch. AR will
implement this. We will notify users prior to the first use of this.
- Action AR: continue
discussions with Cluster Resources about trapping jobs that ask for an
invalid ppn
LL suggested that any job that will never run should
be trapped rather than queuing forever; AR will submit this as the
enhancement request.
- Action AR: remove mshow from general user access
The mshow and showq commands give a list of every user's job and is
generally available. It was agreed earlier on in the service that users
should only be able to view their own jobs, so these commands should not be generally available.
AR questioned whether we should be using the torque-level commands such as
qstat at all or if we should only use the MOAB-level ones such as
mshow. The problem with commands such as qdel not recognising a job
number may be due to domain issues; JBH said that in his experience DNS
issues can have many manifestations. LL said that any command can be hidden.
- Action AR: add directory for utilities under appmaint's control to
the default PATH
AR has set up /usr/local/bham, which despite the name is available
across the cluster, owned by appmaint for utilities such as Xfe (the
graphical file manager) which shouldn't need a module load command.
The bin subdirectory needs to be available to all users by default.
- Action AR/AKG:
consider mechanism for de-registering users
We know of 2 users
who have left and so can be used to build experience of de-registering
users.
- Standing Action
AR/AKG: Report any hardware/software faults that directly impact the
service
- AR has visited John Veitch in Physics who is submitting many
thousands of jobs. He now has a cron job that checks how many of his
jobs are in the system and submits the next batch when most have
completed. Some of these jobs also appear to be locked due to them not
being able to see the .bashrc file, which is probably a GPFS issue.
- Action AR: prevent cron jobs on worker nodes
AR has locked out cron jobs on the logon nodes; LL said
that this should also be done on the worker nodes.
- AKG said that offlined nodes still appear to accept batch jobs; we
do not understand why this happens.
- A brief campus-wide power outage on Friday resulted in the APC
power strips on the unprotected supply losing contact with the machine
room circuit breakers. These strips are designed to remain off when
power is restored to them. The network switched are connected to these
strips so AR had to bring each of these up manually. We
are expecting a visit from estates on Thursday to look at this.
PSH will raise the issue of why the cluster was set up with this failure
mode with Clustervision
Action AR: discuss options to have the switches on the protected
power supply
Action PSH: discuss cluster setup with Clustervision
Resilience would be improved if the switches were on the
protected supply; AR will talk to Kul Gill about this
- one of the network switches keeps giving port errors; this doesn't
prevent u4 being used but should not be happening. Clustervision have
found an error on this switch which is also present on other ones.
- Action AR: raise concern with Clustervision about DHCP dying
we had DHCP issues; it was not running on filer1 or filer2. If DHCP
is under the control of HA (High Availability) why wasn't this trapped?
This was resolved by restarting DHCP on filer001
- AR said that a recent backup had not started due to some processes
that had been running on filer001 now running on filer002 - we are not
sure why this has happened.
- Action PSH: arrange
conference call with ClusterResources as required
ClusterResources have indicated their willingness to take part in
a conference call to discuss any outstanding issues.
- Action PSH: contact
potential Matlab user (Jihong Wang in Elec Eng)
PSH
has contacted and will visit Jihong to advise on using both his own local license server and
the central one.
- Action PSH: finalise
the allinea ddt configuration
Action PSH: produce help page for allinea ddt, and opt when available
The debugger allinea ddt has been installed and PSH/MM have
been looking at the configuration. MM was working on the configuration
files for the optimiser allinea opt and on integrating mpiexec with
the optimiser - it expects ssh access. MM was in discussion with Allinea,
including having a logon on their system, on this but there are some bugs
that have not be fixed in the current release despite assurances from
Allinea. AR said that Bristol are also pursuing this and we are probably
best to await developments driven by Bristol.
PSH and MM were working on the configuration of the debugger, which PSH
hopes can soon be released.
- Action ALL: send PSH
suggestions for parallel programming web resources
Links to parallel programming resources, such as help pages,
tutorials and courses, are available on the BEAR help site in the
'Parallel programming' section.
- Action ALL: suggest
example programs and scripts to PSH
Action PSH: make these available to users
A set of simple example programs, and scripts for example for
parallel programming, would be helpful to users.
- Standing Action ALL:
discuss any user issues
- A helpdesk call has been logged about recovering
accidentally-deleted files; AKG is working on this
- AKG is also clarifying the procedure with the helpdesk for
re-directing calls when he is on holiday - this didn't happen during his
recent break.
- Standing action
LL/AR: present and discuss user statistics
None tabled this week
Any Other Business
- Action AR/AKG: send PSH details of outstanding calls
PSH asked for a summary of outstanding calls with Clustervision and
ClusterResources since he was concerned that we have an excessive number of
such calls
- Action LL: advise on how to introduce a delay between job
submissions per user and a limit on the number of queued jobs/user ...
Action AR: ... and implement this
We do not know if it was the number of jobs, the rate of job
submission or some other factor that has given recent problems with the job
submission. AR asked if the number of queued jobs per user should be set to
a high limit. LL said that we can specify a global qsub parameter to
introduce a delay between job submissions per user and that we can set a
per-user limit on the number of queued jobs per user.
Completed actions from the previous meeting and Items of Information
- JBH has contacted Oxford about Oxford about limiting the number of cores
that an interactive Matlab job can get on the MS cluster. They have also
said that this cannot be done. PSH was talking to Mathworks at the e-Science
All Hands meeting last week who may be able to provide a
not-very-satisfactory workaround.
- Hummingbird visited on 12 September and set up a demonstration system
which has an X proxy running on PSH's Scientific Linux machine that allows X
sessions to be preserved across logons. It also cuts down the traffic to the
desktop.
Next meetings:
10.00 every second Tuesday in the Elms Road Demo Room unless notified of any
changes.