The HECToR Service is now closed and has been superceded by ARCHER.

HECToR Monthly Report, April 2010

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 April 2010 to 08:00 1 May 2010
Number of hours: 720

1: Availability

Scheduled down time: 15 hours 00 minutes.

Incidents

The following incidents were recorded:

SeverityNumber
110
24
314
40

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 14 SEV-3 Incidents, 14 were attributed to single node failure events.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-3306 07/04/2010 SAFE and Website Unavailable 00:43 Cray
Incident-3316 08/04/2010 Maintenance overrun 04:07 Cray
Incident-3311 09/04/2010 Link Inactive 01:58 Cray
Incident-3331 10/04/2010 Cabinet PDU failure 19:35 Cray
Incident-3346 14/04/2010 Maintenance session overrun 00:53 Cray
Incident-3356 15/04/2010 Voltage Fault on Node 02:30 Cray
Incident-3361 17/04/2010 Cabinet Sensor Failure 16:32 Cray
Incident-3371 20/04/2010 Emergency maintenance - System restart 02:39 Cray
Incident-3416 26/04/2010 Emergency maintenance - Revert esfs software change 01:30 Cray
Incident-3421 26/04/2010 Emergency maint - Job Slowdown Analysis and Testing 84:32 Cray

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray1073134:59:0080.9%
Site000:00:00100%
External000:00:00100%
Other000:00:00100%
Overall1073134:59:0080.9%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

Details of single node failures

Error Type Number
RX message header CRC Error 5
RX message CRC Error 2
RX packet sequence number error 5
Software/application related error 1
Heartbeat error 1

2: Courses

This information is supplied by NAG Ltd
Title of Course Dates Available Places Ordinary Attendees Paying Attendees CSE Staff Total Attending
Parallel Programming with MPI, University of Nottingham 12 - 14 April 2010 30 19 0 0 19
OpenMP, University of Nottingham 15 - 16 April 2010 30 11 0 0 11
Best Practice in HPC Software Development, Imperial College, London 12 - 16 April 2010 30 9 0 0 9
Advanced Computational Methods, University of Southampton Every Thursday in April (1,8,15,22,29) 20 9 MSc students (i.e. those being assessed) + 5 other attendees 0 0 14 (The numbers vary a bit from week to week)

3: Quality Tokens

Date Tokens Awarded Comment Consortium
08-Apr-2010 09:01:28 • • • • • Hector just crashed. Please see note (*) n02
23-Apr-2010 14:42:00 • • • No user comment n02
26-Apr-2010 22:15:41 • • • Using the users to beta test a new form of the file system is poor, and a waste of resource n02
Note (*) : the user token was unrelated to a service failure. Planned maintenance had just started.

4: Hours Worked

GroupDays workedFTEs
USL79.44.47
OSG73.34.12

5: Performance Metrics

Technology Provision

Description TSL FSL Value
Technology reliability 85% 98.5% 80.9%
Technology throughput 7000 hours 8367 hours 6984.2 hours
Capability job completion rate 70% 90% 97.4
Technology MTBF 100 hours 126.4 hours 73

Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description TSL FSL USL Value
Percentage of non-in-depth
queries resolved within one day
85% 97% 99% 98.7%
Number of SP FTEs 7.3 8.0 8.7 8.6
SP Serviceability 80% 99% 99.5% 100%