HECToR Monthly Report, January 2009
Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.
Dates covered: 08:00 1 January 2009 to 08:00 1 February 2009
Number of hours: 744
Scheduled down time: 10 hours.
The following incidents were recorded:
Of the four severity levels, level 1 corresponds to a contractual failure.
Out of the 23 SEV-3 Incidents, 19 were attributed to single node failures.
Details of severity level 1 incidents
|Incident-711||12/01/2009||Link Inactive c17-1c1s0s0l4 c15-1c1s0s0l1||04:56||Cray|
|Incident-716||13/01/2009||Voltage fault on c7-1c0s0n1 causing link failure||01:20||Cray|
|Incident-801||22/01/2009||SDB node hung requiring system reload||04:08||Cray|
|Incident-831||29/01/2009||Link Inactive c12-1c2s0s1l5 c12-1c2s0s0l0||01:22||Cray|
MTBF and Serviceability
- Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
- Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.
Details of XT single node failures
|RX message header CRC Error||8|
|MCA bk0 error (Internal Opteron)||5|
|UME bk0/bk4 (Dimms)||3|
|s/w problem "general protection fault"||2 (1 single node, 1 double node)|
The software problems related to a Casino job. This had not been seen before.
2: CoursesThis information is supplied by NAG Ltd
|Title of Course||Dates||Available places||Ordinary attendees||Paying attendees||CSE Staff||Total attending|
|Parallel Programming with MPI, Imperial College London||5 - 7 January 2009||16||13||0||0||13|
|OpenMP and Mixed-Mode Programming, Imperial College London||8 - 9 January 2009||16||12||0||0||12|
3: Quality tokensNone set this month.
4: Hours worked
5: Performance metrics
|Technology throughput||7000 hours||8367 hours||8472 hours|
|Capability job completion rate||70%||90%||N/A (*)|
|Technology MTBF||100 hours||126.4 hours||146 hours|
(*) No Capability jobs (as per the contract definition) ran in January
There were 2 jobs which ran using more than 2265 cores.
Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month
Note: MTBF is calculated as 732/number of failures
|Percentage of non-in-depth |
queries resolved within one day
|Number of SP FTEs||7.3||8.0||8.7||8.2|