The HECToR Service is now closed and has been superceded by ARCHER.

HECToR Monthly Report, January 2009

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 January 2009 to 08:00 1 February 2009
Number of hours: 744

1: Availability

Scheduled down time: 10 hours.

Incidents

The following incidents were recorded:

SeverityNumber
15
23
323
41

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 23 SEV-3 Incidents, 19 were attributed to single node failures.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-711 12/01/2009 Link Inactive c17-1c1s0s0l4 c15-1c1s0s0l1 04:56 Cray
Incident-716 13/01/2009 Voltage fault on c7-1c0s0n1 causing link failure 01:20 Cray
Incident-751 16/01/2009 WWW/Helpdesk Unavailable 04:16 Site
Incident-801 22/01/2009 SDB node hung requiring system reload 04:08 Cray
Incident-831 29/01/2009 Link Inactive c12-1c2s0s1l5 c12-1c2s0s0l0 01:22 Cray

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray418311:46:0098.4%
Site173204:16:0099.4%
External000:00:00100%
Other000:00:00100%
Overall514616:02:0097.8%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

Details of XT single node failures

Error Type Number
RX message header CRC Error 8
MCA bk0 error (Internal Opteron) 5
UME bk0/bk4 (Dimms) 3
s/w problem "general protection fault" 2 (1 single node, 1 double node)

The software problems related to a Casino job. This had not been seen before.

2: Courses

This information is supplied by NAG Ltd
Title of Course Dates Available places Ordinary attendees Paying attendees CSE Staff Total attending
Parallel Programming with MPI, Imperial College London 5 - 7 January 2009 16 13 0 0 13
OpenMP and Mixed-Mode Programming, Imperial College London 8 - 9 January 2009 16 12 0 0 12

3: Quality tokens

None set this month.

4: Hours worked

GroupDays workedFTEs
USL71.14.0
OSG74.04.2

5: Performance metrics

Technology Provision

Description TSL FSL Value
Technology reliability 85% 98.5% 97.8%
Technology throughput 7000 hours 8367 hours 8472 hours
Capability job completion rate 70% 90% N/A (*)
Technology MTBF 100 hours 126.4 hours 146 hours

(*) No Capability jobs (as per the contract definition) ran in January
There were 2 jobs which ran using more than 2265 cores.
Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description TSL FSL USL Value
Percentage of non-in-depth
queries resolved within one day
85% 97% 99% 100%
Number of SP FTEs 7.3 8.0 8.7 8.2
SP serviceability 80% 99% 99.5% 100%