The HECToR Service is now closed and has been superceded by ARCHER.

HECToR Monthly Report, February 2009

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 February 2009 to 08:00 1 March 2009
Number of hours: 672

1: Availability

Scheduled down time: 15.8 hours.

Incidents

The following incidents were recorded:

SeverityNumber
16
20
325
41

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 25 SEV-3 Incidents, 24 were attributed to single node failures.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-856 03/02/2009 Suspected blower failure in cab 0-0 01:32 Cray
Incident-891 09/02/2009 System failure after SMW reboot 02:13 Cray
Incident-916 10/02/2009 Lustre problems cause service collapse 03:50 Cray
Incident-946 16/02/2009 Voltage vault on module takes out HSN 03:12 Cray
Incident-1001 25/02/2009 Link inactive error - HSN collapses 03:22 Cray
Incident-1011 26/02/2009 Plant problems 09:41 Site

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray514614:09:0097.8%
Site173209:41:0098.5%
External000:00:00100%
Other000:00:00100%
Overall612223:50:0096.4%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

Details of XT single node failures

Error Type Number
RX message header CRC Error 7
MCA bk0/bk2 error (Internal Opteron) 5
UME bk0/bk4 (Dimms) 7
Heartbeat error 2
Coldstart error (node failing to start after reboot) 3

2: Courses

This information is supplied by NAG Ltd
Title of Course Dates Available places Ordinary attendees Paying attendees CSE Staff Total attending
Introduction to HECToR, NAG Oxford 5 February 2009 12 1 0 1 2
Tools and Techniques for Optimising Parallel Codes, NAG Oxford 9 - 11 February 2009 12 3 0 3 6
Fortran 95, NAG Manchester 24 - 26 February 2009 30 15 0 0 15

3: Quality tokens

Date Token Comment
02-Feb-2009 10:52:56 ***** Excellent dedicated team. Very professional and expert service.
09-Feb-2009 10:01:48 • • • • •  I have noticed that Hector has been down every 4-7 days since Christmas. This poor service is exacerbated by fortnightly maintainance sessions, planned to take place during peak user hours - who thought of that! The service has been detrimental to my work
14-Feb-2009 00:13:38 *****  
17-Feb-2009 22:36:10 ***** The support team of Hector is really good.
26-Feb-2009 10:12:12 • • • • •  Hector seems to be even less reliable than before! I noted Hector was down at 6pm yesterday, 8pm yesterday, still down at 8am this morning and now we are told it may be another 6 hours before service is restored. Could engineers be called in earlier?

4: Hours worked

GroupDays workedFTEs
USL71.54.0
OSG73.94.2

5: Performance metrics

Technology Provision

Description TSL FSL Value
Technology reliability 85% 98.5% 97.8%
Technology throughput 7000 hours 8367 hours 8309 hours
Capability job completion rate 70% 90% 100%
Technology MTBF 100 hours 126.4 hours 146 hours

Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description TSL FSL USL Value
Percentage of non-in-depth
queries resolved within one day
85% 97% 99% 100%
Number of SP FTEs 7.3 8.0 8.7 8.2
SP serviceability 80% 99% 99.5% 98.5%