HECToR Monthly Report, July 2008
Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.
Dates covered: 08:00 1 July 2008 to 08:00 1 August 2008
Number of hours: 720
Scheduled down time: 11 hours 37 minutes.
The following incidents were recorded:
Of the four severity levels, level 1 corresponds to a contractual failure.
Details of severity level 1 incidents
|Incident-291||01/07/2008||Lustre failure after loss of OSS node||05:15||Cray|
|Incident-295||03/07/2008||Lustre fail after SCSI errors||03:50||Cray|
|Incident-297||06/07/2008||HSN collapse after compute node failure||03:18||Cray|
|Incident-311||13/07/2008||Machine fail after RX Packet error||01:55||Cray|
|Incident-320||21/07/2008||Fatal link error||02:19||Cray|
|Incident-324||24/07/2008||OST node failed after SCSI error||06:44||Cray|
|Incident-336||31/07/2008||OST 16 failed leading to utter collapse.||04:28||Cray|
MTBF and Serviceability
- Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
- Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.
This information is supplied by NAG Ltd
|Title of Course||Dates||Available places||Total attending||HECToR Users||HECToR Staff|
|14 July 2008||Introduction to HECToR||20||3||2||0|
|18 July 2008||Introduction to HECToR||12||3||2||0|
|21-23 July 2008||Tools and Techniques for Optimising Parallel Codes||12||1||0||0|
There was a course on Testing and Benchmarking scheduled for 18 July but there was no take-up.
3: Quality tokens
None set this month
4: Hours worked
5: Performance metrics
|Technology throughput||7000 hours||8367 hours||8311 hours|
|Capability job completion rate||70%||90%||91 %|
|Technology MTBF||100 hours||126.4 hours||105 hours|
Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month
Note: MTBF is calculated as 732/number of failures
|Percentage of non-in-depth |
queries resolved within one day
|Number of SP FTEs||7.3||8.0||8.7||8.0|