BEAR Technical Group
7 Aug 2007
Present
Aslam Ghumra (IS) [AKG]
Alan Reed (IS) [AR] Chair.
Lawrie Lowe (Physics) [LL]
Jonathan Hunt (IS) [JBH]
Apologies
Paul Hatton (IS) [PSH]
Marcin Mogielnicki (Clustervision) [MM]
Introduction
Notes from these meetings will be a concise summary of issues and actions.
They will not detail the full discussions that took place in the meetings,
nor will the order of issues and actions necessarily reflect the order of
discussion.
Ongoing Actions and Matters Arising from previous meetings
- Quotes have been
received for 1, 2, 4 and 5 additional racks. JBH had asked for a KVM in
the rack that would house the Microsoft cluster; we are not sure if this
has been included. PSH followed this up with Gerd
Busker who has asked for clarification of our requirements.
Action JBH: clarify KVM requirements with Gerd
Busker
- The Windows compute
cluster is on a VLAN from one of the main cluster switches. When this
cluster is re-racked the cables will be extended with a MUTOA under the
machine room floor rather than by purchasing a dedicated switch for the
MS compute cluster.
- ISI came on site
with the wrong type of memory to leave as spares; we are still awaiting
the correct parts. There has been no further communication from ISI.
Action AR: Bring this item up at the telephone meeting on the 8th
August.
- GPFS appears to have
recovered but MM is not sure why - he has a great deal of diagnostic
information for them. Clustervision have
opened a one-off support call with IBM for this (so the formal support
does not start until after the acceptance tests, as agreed). MM has been
running acceptance tests since last Friday which are expected to take about
a week barring further problems. These have already shown some problems,
for example MPI has problems when run across more than 64 nodes, which
MM is addressing. Once MM has complete the acceptance tests they will be
re-run by Daresbury as a sanity check.
- Registrations are
carried out manually by Marcin at present. The
facility for self-registration has been included in the BIIS system but
is not being made available until we release the pilot service.
- A standard Windows
application (that is, one that need not be MS-cluster aware) needs
installing on the Microsoft compute cluster so JBH can work on
authentication and access to this cluster when appropriate - the main
priority for now is the release of the main cluster. After the meeting,
PSH proposed Matlab since it's used fairly widely, has an unlimited-use site license (so we're
not taking licences out of a limited pool) and has perpetual per-version
licences (so this work could continue even if, for some reason, we don't
continue to centrally-fund Matlab support). It
is also MS-cluster aware so it would be useful to follow this up.
AnsysCFX is cluster-aware and, in the light of
PSH's meting with Ansys
and subsequent purchase of parallel licences, may
also be a suitable candidate for gaining experience of a cluster-aware
application. A decision needs to be made whilst we still have the SRIF
funds available.
Action PSH: identify costs associated with the MS Compute Cluster
aware versions of Matlab and AnsysCFX
- JBH has contacted a
Senior Systems Architect at the Microsoft Institute for HPC at
Southampton who agreed to work with us on developing our MS Compute
Cluster, including the requirements to make it available to desktop
users via. Excel - JBH is awaiting a response from his last contact. JBH
will also re-establish contact with Matej Ciesko, who gave the MS Compute Cluster talk at Birmingham earlier
on this year.
Action JBH: continue investigating the requirements to link the
Microsoft cluster to Microsoft applications such as Excel.
Action JBH: follow up contact at Southampton.
JBH will also require full copies of Visual Studio for his work
on the cluster
Action JBH: place an order with John Owen for copies of Visual
Studio
- AR has set up a
small round-robin cluster using the two unused logon nodes. Each node
has its own name and address - bluebear3.bham.ac.uk and
bluebear4.bham.ac.uk) but can also be reached via. bluebear.bham.ac.uk.
MM seemed unclear about what had been proposed by Gerd
Busker and needs to clarify this.(AR has had no response from Gerd)
Action AR: oversee the connectivity of the logon nodes
Action AR: contact Gerd/Alex to establish
progress
Action MM: discuss Gerd Buskers proposal for
the logon node connectivity with Gerd
- The front-end switch
is has been racked and has the power and cables connected. It will be
brought into service at an appropriate point.
Action MM/AR: bring this switch fully into service when appropriate
Action: MM to investigate why login3 and 4 cannot talk through the
gateway
- The 4-way Itanium
SMP server may be included in the cluster as a resource to be managed by
torque. It currently has 8GB of memory for 4 single-core processors; we
need the cost of an additional 8 GB. Filestore
will be accessed through the export server. AR has provided PSH with details
of this server, including the model number and current amount of
installed memory.
Action PSH: find out the cost of additional memory for the Itanium
Pete
Watkins asked via LL why we needed the Itanium at all and who uses it.
AKG said that it was heavily used by people with high ram requirements
and those that cannot run work effectively on the CAPPS cluster. LL
questioned the wisdom add buying more ram for the Itanium when we had
some fat nodes on the bluebear cluster and
could add ram to those.
Action AKG:
Identify the current Itanium users and ask if their projects could be
transferred to the blue bear cluster.
Action ALL: Discuss the use of the Itanium at the next meeting when PSH
returns.
- There is a
monitoring node on one of the management nodes which runs a web server
and so makes management information available through a web interface on
a 100 Mb connection. There has been discussion about how much, if any,
of this information should be publicly accessible; it was decided that,
initially at least, it would only be available to us. Clustervision should be involved in any future
discussions of the effect of having it more widely available, especially
with respect to the load on the web server. This is not urgent and can
be addressed after acceptance.
Action AR: discuss monitoring node with Clustervision.[DEFERRED
till later]
- Full installation of
scientific linux.: LL has sent MM a list of required packages. A full
installation should used even though we recommend using the module
facility. If any packages conflict with the use of the cluster the
package could be removed or its use discouraged.
ACTION:MM to respond
to LL’s package list
Any Other Business
- LL looked at the
partition layout on the login node. The root(/)
partition was too small to fit the whole of scientific linux
which needs about
10GBytes. Also /tmp was far too small and an
area called /local may be redundant.
Action LL: Circulate a recommended partition table.
Action LL: Check and circulate the partition layout on the worker nodes
- The meeting congratulated
AKG on passing the Redhat Certified Engineer
examination.
- AR pointed out that
the man page for showq
does not give the moab version
ACTION:MM to ensure the module man pages take precedence
- The default name of
a user area currently, for example /bham/bb/home/ccc/reeda
,should be as short as possible
Action LL: To remind MM that we want a shorter name e.g. /bham/ccc/reeda
- Amendments to draft
notes by AR on the Pilot service:-
512 nodes should read 256
Inconsistent use of bluebear/BlueBear/Blue Bear/ blue bear/ etc
Add reference to blue bear web site and standard installation and
configuration web sites for putty and exceed
- Wednesday’s
phone meeting was discussed and it will take place in Physics West room
330
Next meetings:
- 10.00 each Tuesday
in the Elms Road Committee Room unless notified of any changes.
Alan Reed
Information Services
|