HECToR Monthly Report, April 2009

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 April 2009 to 08:00 1 May 2009
Number of hours: 720

1: Availability

Scheduled down time: 14.5 hours.

Incidents

The following incidents were recorded:

SeverityNumber
13
21
327
40

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 27 SEV-3 Incidents, 27 were attributed to single node failures.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-1236 03/04/2009 Link Inactive c18-0c1s1s3l1 c16-0c1s1s3l4 02:20 Cray
Incident-1276 07/04/2009 Overload of Lustre 02:07 Cray
Incident-1361 22/04/2009 Blower failure 01:52 Cray

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray324406:19:0099.1%
Site000:00:00100%
External000:00:00100%
Other000:00:00100%
Overall324406:19:0099.1%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

Details of XT single node failures

Error Type Number
RX message header CRC Error 11
MCA bk0/bk2 error (Internal Opteron) 5
UME bk0/bk4 (Dimms) 3
Coldstart error (node failing to restart after reboot) 2
Heartbeat error 1
Kernel panic "tx DMA vector invalid" 1
Admin down 4

Note: During "coldstart" and "admin down" errors, no user jobs were affected.

2: Courses

This information is supplied by NAG Ltd
Title of Course Dates Available places Ordinary attendees Paying attendees CSE Staff Total attending
Programming the X2 Vector System, NAG Oxford 2 - 3 April 2009 12 5 0 2 7
Introduction to High Performance Computing, University of Warwick 20 - 24 April 2009 12 8 2 0 10

3: Quality tokens

21-Apr-2009 16:07:48 * * * * * positive feedback from e42
07-Apr-2009 11:40:08 •  •  •  •  Our research group is begining to suspect your machine status notification system is failing. Hector has been down several times this month without apparent action being taken until we have sent an email to helpdesk, often while web says Hector 'open'.
  • Note: Automated monitoring and alerting ensures that staff are engaged as soon as a systems fault occurs. There may be instances when the web status still shows Open whilst initial investigations are underway. The status is automatically updated as and when specific services on HECToR are halted/restarted. Time is also allowed for the support staff to attempt to resolve any problems prior to mailing all users.
  • Additional details on the status on the X2 will be added to the website as this is not currently reported.
  • 4: Hours worked

    GroupDays workedFTEs
    USL91.24.7
    OSG73.04.1

    5: Performance metrics

    Technology Provision

    Description TSL FSL Value
    Technology reliability 85% 98.5% 99.1%
    Technology throughput 7000 hours 8367 hours 8535 hours
    Capability job completion rate 70% 90% 100%
    Technology MTBF 100 hours 126.4 hours 244 hours

    Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

    Note: MTBF is calculated as 732/number of failures

    Service Provision

    Description TSL FSL USL Value
    Percentage of non-in-depth
    queries resolved within one day
    85% 97% 99% 98.8%
    Number of SP FTEs 7.3 8.0 8.7 8.8
    SP serviceability 80% 99% 99.5% 100%