bear-tech-comm: notes from last meeting

BEAR Technical Group
7 Aug 2007

Present
Aslam Ghumra (IS) [AKG]
Alan Reed (IS) [AR] Chair.
Lawrie Lowe (Physics) [LL]
Jonathan Hunt (IS) [JBH]

Apologies
Paul Hatton (IS) [PSH]
Marcin Mogielnicki (Clustervision) [MM]

Introduction

Notes from these meetings will be a concise summary of issues and actions. They will not detail the full discussions that took place in the meetings, nor will the order of issues and actions necessarily reflect the order of discussion.

Ongoing Actions and Matters Arising from previous meetings

Quotes have been received for 1, 2, 4 and 5 additional racks. JBH had asked for a KVM in the rack that would house the Microsoft cluster; we are not sure if this has been included. PSH followed this up with Gerd Busker who has asked for clarification of our requirements.
Action JBH: clarify KVM requirements with Gerd Busker
The Windows compute cluster is on a VLAN from one of the main cluster switches. When this cluster is re-racked the cables will be extended with a MUTOA under the machine room floor rather than by purchasing a dedicated switch for the MS compute cluster.
ISI came on site with the wrong type of memory to leave as spares; we are still awaiting the correct parts. There has been no further communication from ISI.
Action AR: Bring this item up at the telephone meeting on the 8^th August.
GPFS appears to have recovered but MM is not sure why - he has a great deal of diagnostic information for them. Clustervision have opened a one-off support call with IBM for this (so the formal support does not start until after the acceptance tests, as agreed). MM has been running acceptance tests since last Friday which are expected to take about a week barring further problems. These have already shown some problems, for example MPI has problems when run across more than 64 nodes, which MM is addressing. Once MM has complete the acceptance tests they will be re-run by Daresbury as a sanity check.
Registrations are carried out manually by Marcin at present. The facility for self-registration has been included in the BIIS system but is not being made available until we release the pilot service.
A standard Windows application (that is, one that need not be MS-cluster aware) needs installing on the Microsoft compute cluster so JBH can work on authentication and access to this cluster when appropriate - the main priority for now is the release of the main cluster. After the meeting, PSH proposed Matlab since it's used fairly widely, has an unlimited-use site license (so we're not taking licences out of a limited pool) and has perpetual per-version licences (so this work could continue even if, for some reason, we don't continue to centrally-fund Matlab support). It is also MS-cluster aware so it would be useful to follow this up.
AnsysCFX is cluster-aware and, in the light of PSH's meting with Ansys and subsequent purchase of parallel licences, may also be a suitable candidate for gaining experience of a cluster-aware application. A decision needs to be made whilst we still have the SRIF funds available.
Action PSH: identify costs associated with the MS Compute Cluster aware versions of Matlab and AnsysCFX
JBH has contacted a Senior Systems Architect at the Microsoft Institute for HPC at Southampton who agreed to work with us on developing our MS Compute Cluster, including the requirements to make it available to desktop users via. Excel - JBH is awaiting a response from his last contact. JBH will also re-establish contact with Matej Ciesko, who gave the MS Compute Cluster talk at Birmingham earlier on this year.
Action JBH: continue investigating the requirements to link the Microsoft cluster to Microsoft applications such as Excel.
Action JBH: follow up contact at Southampton.
JBH will also require full copies of Visual Studio for his work on the cluster
Action JBH: place an order with John Owen for copies of Visual Studio
AR has set up a small round-robin cluster using the two unused logon nodes. Each node has its own name and address - bluebear3.bham.ac.uk and bluebear4.bham.ac.uk) but can also be reached via. bluebear.bham.ac.uk. MM seemed unclear about what had been proposed by Gerd Busker and needs to clarify this.(AR has had no response from Gerd)
Action AR: oversee the connectivity of the logon nodes
Action AR: contact Gerd/Alex to establish progress
Action MM: discuss Gerd Buskers proposal for the logon node connectivity with Gerd
The front-end switch is has been racked and has the power and cables connected. It will be brought into service at an appropriate point.
Action MM/AR: bring this switch fully into service when appropriate
Action: MM to investigate why login3 and 4 cannot talk through the gateway
The 4-way Itanium SMP server may be included in the cluster as a resource to be managed by torque. It currently has 8GB of memory for 4 single-core processors; we need the cost of an additional 8 GB. Filestore will be accessed through the export server. AR has provided PSH with details of this server, including the model number and current amount of installed memory.
Action PSH: find out the cost of additional memory for the Itanium
Pete Watkins asked via LL why we needed the Itanium at all and who uses it. AKG said that it was heavily used by people with high ram requirements and those that cannot run work effectively on the CAPPS cluster. LL questioned the wisdom add buying more ram for the Itanium when we had some fat nodes on the bluebear cluster and could add ram to those.
Action AKG: Identify the current Itanium users and ask if their projects could be transferred to the blue bear cluster.
Action ALL: Discuss the use of the Itanium at the next meeting when PSH returns.
There is a monitoring node on one of the management nodes which runs a web server and so makes management information available through a web interface on a 100 Mb connection. There has been discussion about how much, if any, of this information should be publicly accessible; it was decided that, initially at least, it would only be available to us. Clustervision should be involved in any future discussions of the effect of having it more widely available, especially with respect to the load on the web server. This is not urgent and can be addressed after acceptance.
Action AR: discuss monitoring node with Clustervision.[DEFERRED till later]
Full installation of scientific linux.: LL has sent MM a list of required packages. A full installation should used even though we recommend using the module facility. If any packages conflict with the use of the cluster the package could be removed or its use discouraged.
ACTION:MM to respond to LL’s package list

Any Other Business

LL looked at the partition layout on the login node. The root(/) partition was too small to fit the whole of scientific linux which needs about 10GBytes. Also /tmp was far too small and an area called /local may be redundant.
Action LL: Circulate a recommended partition table.
Action LL: Check and circulate the partition layout on the worker nodes
The meeting congratulated AKG on passing the Redhat Certified Engineer examination.
AR pointed out that the man page for showq does not give the moab version
ACTION:MM to ensure the module man pages take precedence
The default name of a user area currently, for example /bham/bb/home/ccc/reeda ,should be as short as possible
Action LL: To remind MM that we want a shorter name e.g. /bham/ccc/reeda
Amendments to draft notes by AR on the Pilot service:-
512 nodes should read 256
Inconsistent use of bluebear/BlueBear/Blue Bear/ blue bear/ etc
Add reference to blue bear web site and standard installation and configuration web sites for putty and exceed
Wednesday’s phone meeting was discussed and it will take place in Physics West room 330

Next meetings:

10.00 each Tuesday in the Elms Road Committee Room unless notified of any changes.

Alan Reed
Information Services

BEAR Technical Group 7 Aug 2007

BEAR Technical Group
7 Aug 2007