HECToR Monthly Report, February 2009

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 February 2009 to 08:00 1 March 2009
Number of hours: 672

1: Availability

Scheduled down time: 15.8 hours.

Incidents

The following incidents were recorded:

SeverityNumber
16
20
325
41

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 25 SEV-3 Incidents, 24 were attributed to single node failures.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-856 03/02/2009 Suspected blower failure in cab 0-0 01:32 Cray
Incident-891 09/02/2009 System failure after SMW reboot 02:13 Cray
Incident-916 10/02/2009 Lustre problems cause service collapse 03:50 Cray
Incident-946 16/02/2009 Voltage vault on module takes out HSN 03:12 Cray
Incident-1001 25/02/2009 Link inactive error - HSN collapses 03:22 Cray
Incident-1011 26/02/2009 Plant problems 09:41 Site

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray514614:09:0097.8%
Site173209:41:0098.5%
External000:00:00100%
Other000:00:00100%
Overall612223:50:0096.4%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

Details of XT single node failures

Error Type Number
RX message header CRC Error 7
MCA bk0/bk2 error (Internal Opteron) 5
UME bk0/bk4 (Dimms) 7
Heartbeat error 2
Coldstart error (node failing to start after reboot) 3

2: Courses

This information is supplied by NAG Ltd
Title of Course Dates Available places Ordinary attendees Paying attendees CSE Staff Total attending
Introduction to HECToR, NAG Oxford 5 February 2009 12 1 0 1 2
Tools and Techniques for Optimising Parallel Codes, NAG Oxford 9 - 11 February 2009 12 3 0 3 6
Fortran 95, NAG Manchester 24 - 26 February 2009 30 15 0 0 15

3: Quality tokens

Date Token Comment
02-Feb-2009 10:52:56 ***** Excellent dedicated team. Very professional and expert service.
09-Feb-2009 10:01:48 • • • • •  I have noticed that Hector has been down every 4-7 days since Christmas. This poor service is exacerbated by fortnightly maintainance sessions, planned to take place during peak user hours - who thought of that! The service has been detrimental to my work
14-Feb-2009 00:13:38 *****  
17-Feb-2009 22:36:10 ***** The support team of Hector is really good.
26-Feb-2009 10:12:12 • • • • •  Hector seems to be even less reliable than before! I noted Hector was down at 6pm yesterday, 8pm yesterday, still down at 8am this morning and now we are told it may be another 6 hours before service is restored. Could engineers be called in earlier?

4: Hours worked

GroupDays workedFTEs
USL71.54.0
OSG73.94.2

5: Performance metrics

Technology Provision

Description TSL FSL Value
Technology reliability 85% 98.5% 97.8%
Technology throughput 7000 hours 8367 hours 8309 hours
Capability job completion rate 70% 90% 100%
Technology MTBF 100 hours 126.4 hours 146 hours

Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description TSL FSL USL Value
Percentage of non-in-depth
queries resolved within one day
85% 97% 99% 100%
Number of SP FTEs 7.3 8.0 8.7 8.2
SP serviceability 80% 99% 99.5% 98.5%