HECToR Monthly Report, March 2010
Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.
Dates covered: 08:00 1 March 2010 to 08:00 1 February 2010
Number of hours: 744
1: Availability
Scheduled down time: 11 hours 02 minutes.
Incidents
The following incidents were recorded:
Severity | Number |
1 | 7 |
2 | 0 |
3 | 29 |
4 | 0 |
Of the four severity levels, level 1 corresponds to a contractual failure.
Out of the 29 SEV-3 Incidents, 29 were attributed to single node failure events.
Details of severity level 1 incidents
ID | Date | Description | Length | Attribution |
Incident-3077 | 01/03/2010 | Voltage fault on module | 01:19 | Cray |
Incident-3107 | 05/03/2010 | High Speed Network Failure. Faulty Mezzanine replaced | 01:43 | Cray |
Incident-3102 | 07/03/2010 | Loss of chilled-water due to valve failure | 20:38 | Site |
Incident-3192 | 17/03/2010 | Link Error caused High Speed Network failure | 01:09 | Cray |
Incident-3257 | 30/03/2010 | Cabinet Emergency Power Off. Blower Fault | 08:44 | Cray |
Incident-3262 | 30/03/2010 | MDS node failure | 02:35 | Cray |
Incident-3267 | 30/03/2010 | Fault on 11kV distribution network | 14:18 | External |
MTBF and Serviceability
Attribution | Failures | MTBF | UDT | Serviceability |
Cray | 5 | 146 | 15:30:00 | 97.9% |
Site | 1 | 732 | 20:38:00 | 97.2% |
External | 1 | 732 | 14:18:00 | 98.0% |
Other | 0 | ∞ | 00:00:00 | 100% |
Overall | 7 | 105 | 50:26:00 | 93.1% |
- Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
- Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.
Details of single node failures
Error Type | Number |
RX message header CRC Error | 11 |
RX message CRC Error | 2 |
RX packet sequence number error | 1 |
MCA bk0 error (internal Opteron cache error) | 1 |
Software/application related error (*) | 9 |
Admin down/out of config | 3 |
Heartbeat error | 2 |
(*) Note: Two of the node failures were attributed to a portals bug, now fixed with CLE 2.2 upgrade
2: Courses
This information is supplied by NAG LtdTitle of Course | Dates | Available Places | Ordinary Attendees | Paying Attendees | CSE Staff | Total Attending |
Fortran95, Oxford | 24 - 26 March 2010 | 12 | 10 | 0 | 0 | 10 |
Advanced Computational Methods, University of Southampton | Every Thursday in March (4,11,18,25) | 20 | 14 (of which 9 MSc students) | 0 | 0 | 14 (numbers vary slightly from week to week) |
3: Quality Tokens
Date | Tokens Awarded | Comment | Consortium |
31-Mar-2010 10:30:52 | • • • • • | Hector down again (three sets of penalty tokens in total by the same user on 30-31 March 2010) |
n02 |
23-Mar-2010 11:23:52 | * * * * | Amazing resource for research | e136 |
04-Mar-2010 09:23:11 | • • • | Repeated unplanned system downtime, scheduling problems and maintenance overruns | n02 |
02-Mar-2010 18:20:22 | * * * | No miracles have been performed, but responsiveness is good and the support is both friendly and helpful. | e24 |
4: Hours Worked
Group | Days worked | FTEs |
USL | 83.9 | 4.73 |
OSG | 82.2 | 4.63 |
5: Performance Metrics
Technology Provision
Description | TSL | FSL | Value |
Technology reliability | 85% | 98.5% | 97.9% |
Technology throughput | 7000 hours | 8367 hours | 8046 hours |
Capability job completion rate | 70% | 90% | 97.7% |
Technology MTBF | 100 hours | 126.4 hours | 146 |
Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month
Note: MTBF is calculated as 732/number of failures
Service Provision
Description | TSL | FSL | USL | Value |
Percentage of non-in-depth queries resolved within one day | 85% | 97% | 99% | 98.9% |
Number of SP FTEs | 7.3 | 8.0 | 8.7 | 9.4 |
SP Serviceability | 80% | 99% | 99.5% | 97.2% |