The HECToR Service is now closed and has been superceded by ARCHER.

HECToR Monthly Report, November 2008

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 November 2008 to 08:00 1 December 2008
Number of hours: 720

1: Availability

Scheduled down time: 11 hours 41 minutes.

Incidents

The following incidents were recorded:

SeverityNumber
13
20
322
40

Of the four severity levels, level 1 corresponds to a contractual failure.

All of the above SEV-3 Incidents were attributed to single node failures.

Details of severity level 1 incidents

ID Date Description Length Attribution
Incident-464 12/11/2008 System failure after lustre errors 02:33 Cray
Incident-477 25/11/2008 Link inactive problem 01:46 Cray
Incident-480 27/11/2008 Voltage fault on mezz card 01:25 Cray

MTBF and Serviceability

AttributionFailuresMTBFUDTServiceability
Cray324405:44:0099.2%
Site000:00:00100%
External000:00:00100%
Other000:00:00100%
Overall324405:44:0099.2%
  • Note 1: Serviceability%= 100*(WCT-SDT-UDT)/(WCT-SDT)
  • Note 2: MTBF (Mean Time Between Failures) is defined as 732/Number of failures.

Details of XT single node failures

Error Type Number
RX message header CRC Error 11
MCA bk0 error (Internal Opteron) 2
UME bk0/bk4 (Dimms) 6
Seastar trap 1 error (PPC data cache parity error) 1
Node did not come back after reboot 1

Details of X2 single node failures

Error Type Number
Software bug 1 (triple nodes)

2: Courses

This information is supplied by NAG Ltd

Title of Course Dates Available places Ordinary attendees Paying attendees CSE Staff Total attending
Introduction to MPI, NAG Manchester 11-12 November 2008 16 11 3 1 15
Advanced MPI, NAG Manchester 13 November 2008 16 7 2 1 10
MPI One-Sided Communication and MPI-IO, NAG Manchester (1) 14 November 2008 16 0 0 0 0
OpenMP and Mixed-Mode Programming, NAG Manchester 25 - 26 November, 2008 16 6 1 1 8

(1) Cancelled. Two people registered but later withdrew.

3: Quality tokens

16 November 2008 21:39:59 Mr Andrea Spitaleri **** I am very happy so far with the hector service.
26 November 2008 14:05:31 Mr Hristo Iliev **** How about some 24hrs parallel queues?

4: Hours worked

GroupDays workedFTEs
USL65.13.7
OSG76.34.3

5: Performance metrics

Technology Provision

Description TSL FSL Value
Technology reliability 85% 98.5% 99.2%
Technology throughput 7000 hours 8367 hours 8575 hours
Capability job completion rate 70% 90% 92 %
Technology MTBF 100 hours 126.4 hours 244 hours

Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description TSL FSL USL Value
Percentage of non-in-depth
queries resolved within one day
85% 97% 99% 100%
Number of SP FTEs 7.3 8.0 8.7 8.0
SP serviceability 80% 99% 99.5% 100%