HECToR Monthly Report, April 2008

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 April 2008 to 08:00 1 May 2008
Number of hours: 720

1: Availability

Scheduled down time: 8 hours 51 minutes

Incidents

The following incidents were recorded:

SeverityNumber
16
22
329
41

Of the four severity levels, level 1 corresponds to a contractual failure.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-180 04/04/2008 Service Node hector03 c0-0c0s2n0 failed 05:58 Cray
Incident-183 07/04/2008 OST7 failure causing Lustre collapse 07:58 Cray
Incident-190 10/04/2008 Service failure due to "portals" problem 11:24 Cray
Incident-191 12/04/2008 IO module c2-1c0s6 failure disrupts HSN 01:46 Cray
Incident-207 26/04/2008 Main PDU failure in cab c0-4 10:59 Cray
Incident-211 28/04/2008 Service close after "lustre" errors 01:16 Cray

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray612239:21:0094.6%
Site0 ~ 00:00:00100%
External0 ~ 00:00:00100%
Other0 ~ 00:00:00100%
Overall612239:21:0094.6%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

2: Courses

This information is supplied by NAG Ltd

3: Quality tokens

Apr 18, 2008 1:07:59 PM MR Laszlo Oroszlany x x x could not get pathscale working with help of manual/wiki
Apr 5, 2008 10:46:14 AM Dr George N Barakos * * * *  
Apr 1, 2008 12:57:50 PM MR Laszlo Oroszlany x  

4: Hours worked

GroupDays workedFTEs
USL 83.4 4.4
OSG 64.5 3.6

5: Performance metrics

Technology Provision

Description TSL FSL Value
Technology reliability 85% 98.5% 94.6%
Technology throughput 7000 hours 8367 hours 8206 hours
Capability job completion rate 70% 90% 95%
Technology MTBF 100 hours 126.4 hours 122 hours

Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description TSL FSL USL Value
Percentage of non-in-depth
queries resolved within one day
85% 97% 99% 100%
Number of SP FTEs 7.3 8.0 8.7 8.3
SP serviceability 80% 99% 99.5% 100%