HECToR

HECToR Monthly Report, February 2009

Information on the utilisation, disk allocations, slowdowns and helpdesk statistics can be found in the associated SAFE monthly report.

Dates covered: 08:00 1 February 2009 to 08:00 1 March 2009
Number of hours: 672

Scheduled down time: 15.8 hours.

Incidents

The following incidents were recorded:

Of the four severity levels, level 1 corresponds to a contractual failure.

Out of the 25 SEV-3 Incidents, 24 were attributed to single node failures.

Details of severity level 1 incidents

ID	Date	Description	Length	Attribution
Incident-856	03/02/2009	Suspected blower failure in cab 0-0	01:32	Cray
Incident-891	09/02/2009	System failure after SMW reboot	02:13	Cray
Incident-916	10/02/2009	Lustre problems cause service collapse	03:50	Cray
Incident-946	16/02/2009	Voltage vault on module takes out HSN	03:12	Cray
Incident-1001	25/02/2009	Link inactive error - HSN collapses	03:22	Cray
Incident-1011	26/02/2009	Plant problems	09:41	Site

MTBF and Serviceability

Error Type	Number
RX message header CRC Error	7
MCA bk0/bk2 error (Internal Opteron)	5
UME bk0/bk4 (Dimms)	7
Heartbeat error	2
Coldstart error (node failing to start after reboot)	3

This information is supplied by NAG Ltd

Title of Course	Dates	Available places	Ordinary attendees	Paying attendees	CSE Staff	Total attending
Introduction to HECToR, NAG Oxford	5 February 2009	12	1	0	1	2
Tools and Techniques for Optimising Parallel Codes, NAG Oxford	9 - 11 February 2009	12	3	0	3	6
Fortran 95, NAG Manchester	24 - 26 February 2009	30	15	0	0	15

Date	Token	Comment
02-Feb-2009 10:52:56	*****	Excellent dedicated team. Very professional and expert service.
09-Feb-2009 10:01:48	• • • • •	I have noticed that Hector has been down every 4-7 days since Christmas. This poor service is exacerbated by fortnightly maintainance sessions, planned to take place during peak user hours - who thought of that! The service has been detrimental to my work
14-Feb-2009 00:13:38	*****
17-Feb-2009 22:36:10	*****	The support team of Hector is really good.
26-Feb-2009 10:12:12	• • • • •	Hector seems to be even less reliable than before! I noted Hector was down at 6pm yesterday, 8pm yesterday, still down at 8am this morning and now we are told it may be another 6 hours before service is restored. Could engineers be called in earlier?

Technology Provision

Description	TSL	FSL	Value
Technology reliability	85%	98.5%	97.8%
Technology throughput	7000 hours	8367 hours	8309 hours
Capability job completion rate	70%	90%	100%
Technology MTBF	100 hours	126.4 hours	146 hours

Note: Technology throughput is calculated: 12*(732-UDT-SDT); 732 - annual average number of hours in a month

Note: MTBF is calculated as 732/number of failures

Service Provision

Description	TSL	FSL	USL	Value
Percentage of non-in-depth queries resolved within one day	85%	97%	99%	100%
Number of SP FTEs	7.3	8.0	8.7	8.2
SP serviceability	80%	99%	99.5%	98.5%