Title: bear-tech-comm: meeting notes
BEAR Technical Group
26 February 2008
Present
Lawrie Lowe (Physics) [LL]
Alan
Reed (IT Services) [AR] note taker
Aslam Ghumra (IT Services) [AKG]
Marcin
Mogielnicki (Clustervision) [MM]
Apologies
Paul Hatton (IT Services) [PSH]
Jon Hunt (IT Services) [JBH]
Introduction
Notes from these meetings will be a concise summary of issues and actions.
They will not detail the full discussions that took place in the meetings, nor
will the order of issues and actions necessarily reflect the order of
discussion.
Ongoing Actions and Matters Arising from previous meetings
- Action AKG: link his pbsdsh help page from elsewhere in the BEAR web
site.
AKG has a help page within the BEAR web site based on LL's
pbsdsh web page DONE (but there is a UCMS issue) - removed
- Action AKG: set up FAQ linked from Bear web site
An FAQ
on the BEAR web site would be useful for users to finds solutions to
commonly-encountered problems. This would usually link to other parts of the
web site to avoid duplication of information.
DONE. - Action AKG/AR: discuss user wiki with Roy Pearce
A user
wiki could be useful for users to share experiences; this came up at a meeting
with Computer Science on 18 Feb.
.1 We are using the current wiki on sun19 to log events, record
procedures, and share knowhow.
.2 After the meeting Chris Bayliss said that he did not want a wiki to be on
bluebear, sun19 should only be used
instead of bluebear. AR pointed out to Chris we needed at least 3 different
wiki's , one for management of bluebear ,
one for bluebear users to share tips and tricks and one for computer science
and general use on campus. AR has discussed
with Roy Pearce about improving security on the current wiki. - Action AR: interface the Abaqus documentation to the web server
running on the monitor node.
The documentation server needs
configuring to server the Abaqus documentation - PSH will discuss the details
with AR outside this meeting.
- Action AR: notify helpdesk about the withdrawal of the ssh1
protocol.
the ssh1 protocol has been withdrawn from the logon
nodes; this has not lead to any user queries.
DONE - current status of the Microsoft cluster:
- Action JBH: chase Miles Deegan (Microsoft) to specify what options
are available with various software configurations on the MS
cluster.
we do need SharePoint server to publish and calculate a
spreadsheet although other levels of functionality may be available without
this, such as using User Defined Functions (UDFs) in Excel, written in a
language such as C, to run calculations on the cluster. These options need
clarifying
- Action JBH: give PSH the IP address of the MS cluster head and
NAT-ed slave nodes
The Matlab licences for the exemplar on the
Microsoft cluster have been received; PSH will restrict them to the
appropriate IP addresses when JBH has passed these on.
- Action JBH: set up samba exports for Simon Fitch and
PSH
JBH clarified that it was integration with active directory,
rather than setting up any samba exports, that was problematic. He is able to
set up a small number of such exports, for named users. PSH asked that he and
Simon Fitch could have such exports, primarily for use in the Visualisation
Centre.
- Action LL: circulate an example of pbsdsh use
PSH was
unclear when use of pbsdsh was appropriate, as compared to other ways of
running parallel jobs. An example would help.
- Action LL: contact Cathie Dingwall in IS to arrange helpdesk
training.
LL now has an e-helpdesk license. It was suggested that
LL should receive some training on the system before he was made visible on it
and hence could have calls referred to him.
- Action LL: run his scripts in warning mode to flag excessive CPU use on the
logon nodes to the list.
LL said that his script emails him to flag
any user using more than 15 minutes CPU on the logon nodes. At present this
doesn't take account of how long the user has been logged on.
ON GOING modifications to be made is regard to memory consumption. - Action LL/AR: discuss usage reports generated by MOAB and other
scripts and refer any issues to ClusterResources
The usage reports
generated by MOAB and other scripts appear to be inconsistent in places. LL is
in contact with the author of one of the inconsistent scripts.
- Action LL/AR: bring forward proposals on controlling quote for
projects by the next meeting.
A decision on whether to use filesets
or some other mechanism to apply quota to sheared project filespace is
required urgently.
- Action AG: set up wiki on the cluster
A wiki on the
cluster to, for example, log changes to the system. would be useful. The
monitor server would probably be the best home for this. The wiki server on
sun19 is being used, but this has the drawback of being publically viewable,
although it is not advertised. SEE item 4 above - Action MM: arrange for delivery of on-site spares
on-site
spares are still awaited from ISI DONE arrived! -
removed
- Action PSH: lead discussion of service profile at service
meetings.
Action AR: continue discussions with ClusterResources on
scheduling.
There are many ways of configuring the scheduling
system depending on the type of service we offer, which is a policy rather
than technical, issue. There is an ongoing dialogue between AR and
ClusterResources.
- Action MM/AR: discuss Phase2 delays with John Owen
MM
reported that the delivery of Phase 2 is being delayed until week beginning 3
March due to delays in cabling the racks. This will have implications for the
payment schedule, especially bearing in mind the SRIF spending requirements.
AR and MM will discuss this with John Owen.
- Action PSH: install the Exceed-on-demand software on his desktop Red
Hat PC and a Windows PC to see if it is appropriate for BLUEBear
Exceed (the site-license Windows X server) is a heavyweight X server
(which, in the X world, runs on the user's PC). There is also an application
called Exceed-on-demand which is a lightweight server, transferring most of
the work to the machine that is being connected to (for example, the logon
node). This claims to be easier for the end-user to install and configure than
the full X server, although we have no experience of this. PSH has spoken to a
contact in Hull who use this mainly for system administration through a
Windows desktop; he has confirmed that a desktop persists across the desktop
machine reboot and, by implication, could be picked up from another desktop.
- Action all: feed back comment about PSH's proposed user guideline
document DONE
Action PSH: update this document in the light of comments received
and the current service.
PSH has put a proposed guideline for users
document on the project website, which will be made available on the BEAR web
site in due course.
- Standing Action MM: Report any hardware faults
(1 node down due to kernel panic)
(note that unix messages for all nodes are directed to file001:/var/log/messages) - Standing Action all: discuss any user issues
- Standing action LL/AR: present and discuss user
statistics
PSH said that usage by unique active users and
departments, to show the spread of users, and also by job size would be
useful.
Any Other Business
- Action AR: consider options for backup copies of the Tivoli
database
The Tivoli database is crucial to being able to restore
backups from Tivoli; there should be more than one copy of this file in
diverse locations.
- A discussion took place about access control of projects. Multiple
writers of some projects would be needed.
Action AR: set up a dummy project so that access control ideas can be tested
The following are required prior to the release of the full
service:
- Action PSH/AR, in conjunction with MM: specify default module
set
There are currently many modules loaded by default, not all of
which will be needed by any one user. This leads to an unwieldy default PATH
and library search path. Any changes to the default user environment will need
to be in place prior to the full service.
- Action PSH: install applications and help
pages
installation of applications and help pages that are
currently available on capps or the e-science cluster is ongoing
- Action MM: implement and validate failover of logon
nodes
failover on logon nodes is in place, as two groups of
two, but not activated. One of the groups will be used by MM for
developing/validating the failover. This is to be done in phase2.
- Action JBH: define and implement backup policy
-
this forms part of JBH's backup group.
- Action MM: finalise the allinea configuration
Action PSH: produce
help page for allinea
allinea has been installed. MM needs to work
on the configuration files. The debugger is OK, MM is working on integrating
mpiexec with the optimiser - it expects ssh access. MM is receiving good
support from Allinea on this but there are some bugs that will not be fixed
until the next release, due in April.
Items of Information and completed actions from the previous
meeting
- PSH asked about the priority on different queues in MOAB; at present they
are all the same.
- Discussion about how, or indeed if, short jobs should be given a high
priority for quick test runs. Of course, the fair share system would ensure
that this could not be abused.
- The User Forum will be held on 11 March. AKG has almost completed the
agenda and poster.
- The Tivoli license has been installed. /bb has been backed up; about 2.5TB
took about 3 days to back up.
- The action on PSH to chase Martyn Guest for pre-user-service benchmarks
will be taken off.
- We are still awaiting a replacement disk for u1n002
- There has been a recent helpdesk query raised by an Abaqus user, who has
mis-understood the error messages that he was getting.
- Project space has been set up for William Edmondson from Computer Science.
AR is arranging to transfer his data, requiring the use of his own Mac mini
over firewire.
- A user request for 200TB quota has been received and will be implemented.
Next meetings:
10.00 every Tuesday in the Elms Road Demo Room unless notified of any
changes.
Alan Reed
IT Services