Title: bear-tech-comm: meeting notes

BEAR Technical Group
26 February 2008

Present
Lawrie Lowe (Physics) [LL]
Alan Reed (IT Services) [AR] note taker
Aslam Ghumra (IT Services) [AKG]
Marcin Mogielnicki (Clustervision) [MM]

Apologies
Paul Hatton (IT Services) [PSH]
Jon Hunt (IT Services) [JBH]

Introduction

Notes from these meetings will be a concise summary of issues and actions. They will not detail the full discussions that took place in the meetings, nor will the order of issues and actions necessarily reflect the order of discussion.

Ongoing Actions and Matters Arising from previous meetings

  1. Action AKG: link his pbsdsh help page from elsewhere in the BEAR web site.
    AKG has a help page within the BEAR web site based on LL's pbsdsh web page DONE (but there is a UCMS issue)
  2. removed
  3. Action AKG: set up FAQ linked from Bear web site
    An FAQ on the BEAR web site would be useful for users to finds solutions to commonly-encountered problems. This would usually link to other parts of the web site to avoid duplication of information. DONE.
  4. Action AKG/AR: discuss user wiki with Roy Pearce
    A user wiki could be useful for users to share experiences; this came up at a meeting with Computer Science on 18 Feb.
    .1 We are using the current wiki on sun19 to log events, record procedures, and share knowhow.
    .2 After the meeting Chris Bayliss said that he did not want a wiki to be on bluebear, sun19 should only be used
    instead of bluebear. AR pointed out to Chris we needed at least 3 different wiki's , one for management of bluebear ,
    one for bluebear users to share tips and tricks and one for computer science and general use on campus. AR has discussed
    with Roy Pearce about improving security on the current wiki.
     
  5. Action AR: interface the Abaqus documentation to the web server running on the monitor node.
    The documentation server needs configuring to server the Abaqus documentation - PSH will discuss the details with AR outside this meeting.
  6. Action AR: notify helpdesk about the withdrawal of the ssh1 protocol.
    the ssh1 protocol has been withdrawn from the logon nodes; this has not lead to any user queries. DONE
  7. current status of the Microsoft cluster:
    1. Action JBH: chase Miles Deegan (Microsoft) to specify what options are available with various software configurations on the MS cluster.
      we do need SharePoint server to publish and calculate a spreadsheet although other levels of functionality may be available without this, such as using User Defined Functions (UDFs) in Excel, written in a language such as C, to run calculations on the cluster. These options need clarifying
    2. Action JBH: give PSH the IP address of the MS cluster head and NAT-ed slave nodes
      The Matlab licences for the exemplar on the Microsoft cluster have been received; PSH will restrict them to the appropriate IP addresses when JBH has passed these on.
  8. Action JBH: set up samba exports for Simon Fitch and PSH
    JBH clarified that it was integration with active directory, rather than setting up any samba exports, that was problematic. He is able to set up a small number of such exports, for named users. PSH asked that he and Simon Fitch could have such exports, primarily for use in the Visualisation Centre.
  9. Action LL: circulate an example of pbsdsh use
    PSH was unclear when use of pbsdsh was appropriate, as compared to other ways of running parallel jobs. An example would help.
  10. Action LL: contact Cathie Dingwall in IS to arrange helpdesk training.
    LL now has an e-helpdesk license. It was suggested that LL should receive some training on the system before he was made visible on it and hence could have calls referred to him.
  11. Action LL: run his scripts in warning mode to flag excessive CPU use on the logon nodes to the list.
    LL said that his script emails him to flag any user using more than 15 minutes CPU on the logon nodes. At present this doesn't take account of how long the user has been logged on. ON GOING modifications to be made is regard to memory consumption.
  12. Action LL/AR: discuss usage reports generated by MOAB and other scripts and refer any issues to ClusterResources
    The usage reports generated by MOAB and other scripts appear to be inconsistent in places. LL is in contact with the author of one of the inconsistent scripts.
  13. Action LL/AR: bring forward proposals on controlling quote for projects by the next meeting.
    A decision on whether to use filesets or some other mechanism to apply quota to sheared project filespace is required urgently.
  14. Action AG: set up wiki on the cluster
    A wiki on the cluster to, for example, log changes to the system. would be useful. The monitor server would probably be the best home for this. The wiki server on sun19 is being used, but this has the drawback of being publically viewable, although it is not advertised. SEE item 4 above
  15. Action MM: arrange for delivery of on-site spares
    on-site spares are still awaited  from ISI DONE arrived!
  16. removed
  17. Action PSH: lead discussion of service profile at service meetings.
    Action AR: continue discussions with ClusterResources on scheduling.

    There are many ways of configuring the scheduling system depending on the type of service we offer, which is a policy rather than technical, issue. There is an ongoing dialogue between AR and ClusterResources.
  18. Action MM/AR: discuss Phase2 delays with John Owen
    MM reported that the delivery of Phase 2 is being delayed until week beginning 3 March due to delays in cabling the racks. This will have implications for the payment schedule, especially bearing in mind the SRIF spending requirements. AR and MM will discuss this with John Owen.
  19. Action PSH: install the Exceed-on-demand software on his desktop Red Hat PC and a Windows PC to see if it is appropriate for BLUEBear
    Exceed (the site-license Windows X server) is a heavyweight X server (which, in the X world, runs on the user's PC). There is also an application called Exceed-on-demand which is a lightweight server, transferring most of the work to the machine that is being connected to (for example, the logon node). This claims to be easier for the end-user to install and configure than the full X server, although we have no experience of this. PSH has spoken to a contact in Hull who use this mainly for system administration through a Windows desktop; he has confirmed that a desktop persists across the desktop machine reboot and, by implication, could be picked up from another desktop.
  20. Action all: feed back comment about PSH's proposed user guideline document  DONE
    Action PSH: update this document in the light of comments received and the current service.

    PSH has put a proposed guideline for users document on the project website, which will be made available on the BEAR web site in due course.
  21. Standing Action MM: Report any hardware faults (1 node down due to kernel panic)
    (note that unix messages for all nodes are directed to file001:/var/log/messages)
  22. Standing Action all: discuss any user issues
  23. Standing action LL/AR: present and discuss user statistics
    PSH said that usage by unique active users and departments, to show the spread of users, and also by job size would be useful.

Any Other Business

  1. Action AR: consider options for backup copies of the Tivoli database
    The Tivoli database is crucial to being able to restore backups from Tivoli; there should be more than one copy of this file in diverse locations.
  2. A discussion took place about access control of projects. Multiple writers of some projects would be needed.
    Action AR: set up a dummy project so that access control ideas can be tested

The following are required prior to the release of the full service:

  1. Action PSH/AR, in conjunction with MM: specify default module set
    There are currently many modules loaded by default, not all of which will be needed by any one user. This leads to an unwieldy default PATH and library search path. Any changes to the default user environment will need to be in place prior to the full service.
  2. Action PSH: install applications and help pages
    installation of applications and help pages that are currently available on capps or the e-science cluster is ongoing
  3. Action MM: implement and validate failover of logon nodes
    failover on logon nodes is in place, as two groups of two, but not activated. One of the groups will be used by MM for developing/validating the failover. This is to be done in phase2.
  4. Action JBH: define and implement backup policy
     - this forms part of JBH's backup group.
  5. Action MM: finalise the allinea configuration
    Action PSH: produce help page for allinea
    allinea has been installed. MM needs to work on the configuration files. The debugger is OK, MM is working on integrating mpiexec with the optimiser - it expects ssh access. MM is receiving good support from Allinea on this but there are some bugs that will not be fixed until the next release, due in April.

Items of Information and completed actions from the previous meeting

  1. PSH asked about the priority on different queues in MOAB; at present they are all the same.
  2. Discussion about how, or indeed if, short jobs should be given a high priority for quick test runs. Of course, the fair share system would ensure that this could not be abused.
  3. The User Forum will be held on 11 March. AKG has almost completed the agenda and poster.
  4. The Tivoli license has been installed. /bb has been backed up; about 2.5TB took about 3 days to back up.
  5. The action on PSH to chase Martyn Guest for pre-user-service benchmarks will be taken off.
  6. We are still awaiting a replacement disk for u1n002
  7. There has been a recent helpdesk query raised by an Abaqus user, who has mis-understood the error messages that he was getting.
  8. Project space has been set up for William Edmondson from Computer Science. AR is arranging to transfer his data, requiring the use of his own Mac mini over firewire.
  9. A user request for 200TB quota has been received and will be implemented.

Next meetings:

10.00 every Tuesday in the Elms Road Demo Room unless notified of any changes.


Alan Reed
IT Services