HECToR XT6 Upgrade Roadmap
The HECToR service will be upgraded this spring to include a 20 cabinet Cray XT6.
The system will comprise 44,544 cores delivering an estimated peak performance of 338 TFlops.
The Cray XT4 will remain the main service at this stage. Later in 2010 we will upgrade to the Gemini network interconnect, at which stage the combination of XT6 and Gemini will form the main service.
This initial upgrade to include the XT6 will be broken down into four main stages of work as outlined below.
Stage 1 - 7th April: Operating System Upgrade
The HECToR operating system will be upgraded to CLE2.2. This is currently planned to take place on Weds 7th April.
We will be contacting key users to start testing on our test server later this week. Unlike the previous operating system upgrade (which had an associated major mpt change), a recompile is not necessary in this instance, however recompiling may help to avoid any performance issues.
Stage 2 - 21st April: /work File System Upgrade
In readiness for the Phase 2B upgrade it is necessary to upgrade the /work filesystem in order that both the current XT4 system and the new XT6 can both access it.
As such, all user data will be copied from the existing /work filesystem to a new external /work filesystem.
In order to avoid an extensive amount of downtime for all users, this work will be phased project by project depending on the volume of data on /work. We estimate that 94% of projects will be covered in a single maintenance slot on Weds 21st April. The remaining 6% will be handled on a case by case basis due to the large volumes of data held. We will be contacting all PIs to advise when your project will be affected and to let you know how this will impact your users. The aim is to have completed this task for all projects by 5th May.
Note - This work will involve a period of time where quota changes via SAFE will be unavailable. Updated disk usage reports will also be unavailable while the process of copying data is ongoing.
Temporary work space will also be available on the new filesystem (which projects can use while original data is being copied) and we will issue full instructions on how to use it.
In advance of this, if you are currently keeping data on /work which you do not require, please delete it. If you have data which can be archived, please do so. If your project does not currently have an archive quota and you have data to archive, please contact the helpdesk and we will set this up for you. By phasing the work project by project we are doing our best to limit the impact of this change, but the less data there is, the quicker the process will be.
Stage 3 - 19th May: Phase 2A Capacity Reduction
As part of the Phase 2B upgrade, the Phase 2A quad-core system will be reduced in size from 60 cabinets to 33 cabinets.
The maximum job size supported at this time will be reduced to 8192 cores.
This is currently planned to take place on Weds 19th May.
Stage 4 - 17th June: Phase 2B User Access
The Phase 2B 24-core system is planned to be online for initial configuration and testing on Weds 2 June.
User access will follow approximately two weeks after this, with availability trials expected to start on 17th June.
All current users will automatically be granted access to the Phase2B system. We will advise on the exact machine configuration and will provide usage instructions on the website nearer the time.
As far as possible the work above will be carried out during standard maintenance slots. We will of course confirm all dates nearer the time as all of the dates are subject to change.
This is a significant program of work and we will do our best to ensure that there is minimal interruption to service.