The HECToR Service is now closed and has been superceded by ARCHER.

HECToR Monthly Report, December 2009

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 December 2009 to 08:00 1 January 2010
Number of hours: 744

1: Availability

Scheduled down time: 11 hours 40 minutes.

Incidents

The following incidents were recorded:

SeverityNumber
14
21
318
40

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 18 SEV-3 Incidents, 18 were attributed to single node failures. 1 SEV-2 incident was attributed to 350 nodes being taken down proactively due to an application-related error.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-2622 01/12/2009 OST problem - system locked and died 02:57 Cray
Incident-2692 15/12/2009 Voltage fault on module c10-2c0s5 01:23 Cray
Incident-2697 16/12/2009 Maintenance overrun due to defective UPS(See Note Below) 44:39 Site
Incident-2742 29/12/2009 Power lost to 2 DDN racks 03:44 Cray
  • Note: This was an extended unplanned outage. Every effort was made to restore the service as soon as possible. Resources from both the service and third party contractors worked onsite outwith normal hours in an effort to resolve the issue.
  • MTBF and Serviceability

    AttributionFailuresMTBFUDTServiceability
    Cray324408:04:0098.9%
    Site173244:39:0093.9%
    External000:00:00100%
    Other000:00:00100%
    Overall418352:43:0092.8%
    • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
    • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

    Details of single node failures

    Error Type Number
    UME bk4 (Dimms) 1
    UME bk 0-4 (Internal Opteron) 1
    RX message header CRC Error 13
    RX packet sequence number error 3

    2: Courses

    This information is supplied by NAG Ltd

    There were no training courses in December.

    3: Quality Tokens

    Date Tokens Awarded Comment Consortium
    04-Dec-2009 09:35:12 Quite a lot of downtime and the 12 hr limit is starting to hinder things a little! e05
    08-Dec-2009 15:22:40 * * * * Positive tokens, no user comments x01, HPC-Europa
    17-Dec-2009 11:45:50 • • • • • The availability of the HECToR service has been very poor this year and now to have such a long downtime during the academic holidays is unacceptable. n02
    18-Dec-2009 09:23:03 • • • • Negative tokens, no user comments n02
    18-Dec-2009 12:10:23 • • • • • It goes withput saying that the interruption to service in Dec. 2009 has been extremely disruptive. |The lack of communication regarding progress in solving the problem is also disappointing. n02

    4: Hours Worked

    GroupDays workedFTEs
    USL69.03.9
    OSG80.5 4.5

    5: Performance Metrics

    Technology Provision

    Description TSL FSL Value
    Technology reliability 85% 98.5% 98.9%
    Technology throughput 7000 hours 8367 hours 8012 hours
    Capability job completion rate 70% 90% 100%
    Technology MTBF 100 hours 126.4 hours 244

    Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

    Note: MTBF is calculated as 732/number of failures

    Service Provision

    Description TSL FSL USL Value
    Percentage of non-in-depth
    queries resolved within one day
    85% 97% 99% 100%
    Number of SP FTEs 7.3 8.0 8.7 8.4
    SP Serviceability 80% 99% 99.5% 93.9%