The HECToR Service is now closed and has been superceded by ARCHER.

Welcome to HECToR News 4, May 2009

Featuring

Introduction

This is the fourth Newsletter for HECToR users from the Computational Science and Engineering support (CSE) team of NAG Ltd. The HECToR newsletter intends to keep users updated with useful information on the national supercomputing service, for the previous issues please see: read the previous issues.

In this issue we have information on HECToR related training courses, an upgrade to the hardware, general points regarding the HECToR programming environment and information on the distributed support service. 

Training

The High Performance Computing (HPC) and HECToR training courses run by NAG Ltd. are provided free of charge to HECToR users and UK academics whose work is covered by the remit of one of the participating research councils (EPSRC, NERC and BBSRC). Over the coming months the following courses are scheduled:

  • May 18-19, University of Cambridge - Introduction to Fortran 95
  • June 1-2, University of Cambridge - Introduction to MPI
  • June 16-18, 2009 Queen's University Belfast - Parallel Programming with MPI
  • July 1-2, 2009 Queen's University Belfast - OpenMP and Mixed-mode Programming
  • September 7-9, 2009 University of Exeter - Fortran 95
  • September 14-16, 2009 University of Exeter - Parallel Programming with MPI
  • September 21, 2009 University of Exeter - Introduction to HECToR
  • September 22-23, 2009 University of Exeter - OpenMP and Mixed-mode Programming

The majority of the courses are aimed at HPC and HECToR. But there is also training in scientific programming with the non HPC Fortran 95 course. This is suitable for anyone who would like to learn the language from scratch or perhaps would like to update or build on their current knowledge. For more information on this new course along with more details about the other courses, please see the course list.

For more information on HECToR training, please see the training page.

Or contact [Email address deleted].

Phase 2a Hardware Upgrade to Quad Core

Upgrade Summary

The main part of HECToR, namely the CRAY XT4, is shortly due to enter Phase 2 of its operational life. Details of the Phase2 roadmap as presented at the recent town hall meeting can be found here.

The current 5664 dual core AMD Opteron processors will all be replaced with quad core AMD Opteron processors. The upgrade will increase HECToR's theoretical peak performance from around 60TF to over 200TF and also increase the total memory from 35TB to 45TB. Please note that the number of nodes will remain the same at 5664, but the number of cores will double to 22656. The actual amount of memory available per node will grow from 6GB to 8GB. But, a user's code will have less memory available on a per core basis.

More information can be found in the following guide prepared by the CSE team: Preparing for HECToR Phase II (Quad Core).

Upgrade Timeline

The upgrade is planned to take place in two stages during June and July. For the first stage of the XT4 upgrade, half of the machine will be unavailable while it is being upgraded with the new quad core AMD opteron processors. The remaining half of the machine will function as normal with the existing 2832 dual core nodes. The second stage of the upgrade will commence when the newly installed quad core processors become operational. At this time, the remaining dual core half of the machine will be upgraded. During this period there should be 11328 cores on the XT4. There will be instances of complete downtime at the beginning and during these stages - this is in order to facilitate the hardware switch. This plan will ensure that users will never have access to mixed dual and quad core nodes. Please also note that the Seastar2 interconnect, Lustre filesystem and X2 Vector machine remain unaltered by this upgrade. Details on revised XT accounting during the upgrade period will be published nearer the time.

Detailed planning of the quad core upgrade continues, and CRAY are reasonably confident that they will be in a position to start this process on 8th June. The achievement of this date is contingent on many variables beyond the reasonable control of CRAY (such as the availability of a huge number of discrete parts from the Far East and the usual logistical issues when dealing with airlines and import/export authorities in three continents).

Code Scaling

Codes which currently scale well on the dual core system and are also capable of distributing the required memory amongst the allocated nodes should also perform well on the new quad core architecture. However, users may find that some codes may only scale up to the corresponding dual core node count.  In such cases, where users are not getting any additional benefit from increased number of cores per node, they will require more compute time and AUs. This is because the AU charging mechanism will still be based upon a per node allocation as per the current phase 1 quad core architecture. If a user finds that their code is at a disadvantage due to the quad core upgrade, then they are encouraged to seek help.

The first point of call is to submit a request via SAFE describing the case. The NAG core CSE team will give advice on how the code might be able to get full benefit from the hardware upgrade. It may be possible that after minor improvements to the code or by using different compilation flags it will scale better and/or vectorise for faster performance. However, in cases where this is not possible the user will be advised on the further help available.

For further information on how to prepare your codes for the upgrade please see the Phase 2 documentation.

Phase2a Training and Science Support

To accompany the Phase 2a hardware upgrade there will be extra training and support provided by the NAG CSE team and Cray Centre of Excellence for HECToR.

Please see the Training section above for further details.  Please also note that the CSE team can test your code for quad core performance right now, before the main upgrade takes place. If you need assistance with this, please contact the helpdesk.

CLE 2.1 OS Upgrade

On May 6 2009 the Cray Linux Environment (CLE) on HECToR was upgraded from version 2.0 to version 2.1.

Key changes as a result of this upgrade are detailed below. Details are also available on the user wiki.

Any users requiring assistance should contact the HECToR Helpdesk.

MPT 2.0 Executables No Longer Supported

Binaries compiled under MPT v2.0 will not run and will be blocked automatically.

If a job is aborted for this reason, you will be presented with the following message:

aprun: MPT 2.0 applications no longer supported
aprun: Exiting due to errors. Application aborted

Any such binaries will have to be recompiled against MPT v3.x.

You can identify MPT v2.0 binaries with the command:

find . -type f -a -perm -100 -a -exec grep 'rs64\.REL_2[a-zA-Z0-9/_.]*xt_allreduce.c' {} \; -print

Recompile CLE 2.0 Binaries

We would strongly recommend that all CLE 2.0 compiled executables are recompiled. Failure to recompile may result in application failure, hangs, or other undefined behaviour.

Hybrid OpenMP/MPI with Pathscale

Any users who are running hybrid OpenMP/MPI codes compiled with Pathscale must set PSC_OMP_AFFINITY=FALSE in their job scripts.

Failure to do so will result in performance problems.

Stack Overflow Errors ulimit

Some users have reported stack overflow errors since the upgrade to CLE2.1.

SUSE Linux Enterprise Server 9 set up the user environment with an unlimited stack size resource limit to work around restrictions in stack handling of multithreaded applications. With SUSE Linux Enterprise Server 10, this is no longer necessary and has been removed. The login environment now defaults to the kernel default stack size limit of 8Mb. To restore the old behavior, add

ulimit -s unlimited
to your job submission script.

X2 Binaries Relinking

Applications that run on the Cray X2 compute nodes need to be relinked. There is no need for all source or libraries to be recompiled. Relinking your binary is sufficient. There is a risk that if you do not relink then your application will produce incorrect results.

New Functionality Huge Files

Huge pages (2MB) are now supported. Previous CNL versions only supported 4KB pages. 4KB pages remain the default; however, users now have the option to use 2MB pages.

To use huge files;

  • Link against the huge page library
    cc c my_app.c
    cc o my_app my_app.o -lhugetlbfs
  • In your run script set the environment variable "HUGETLB_MORECORE=yes"
  • You must use the m option to aprun with the appropriate huge page suffix
  • -m <size>h requests ‘size’ huge pages, for example -m 700h would request 700 MB of huge pages per PE.  If the request cannot be satisfied, you will get as many huge pages as possible, and after that you will get 4KB pages.
  • -m <size>hs REQUIRES ‘size’ huge pages, for example -m 700hs would require 700 MB of huge pages per PE.  If the request cannot be satisfied, your application will fail.

Details on using Huge files are available in the CRAY XT Programming Guide.

It’s also very application/data dependent on what will benefit from this.  Possible candidates which may benefit are codes which exhibit random memory access patterns, for example unstructured mesh applications (unstructured mesh CFD or solid mechanics). 

Programming Environment Issues

Accounting on the X2 has been suspended until August 31st 2009 and there is a significant increase in the usage of the X2. Any new or current HECToR PIs wishing to the use the X2 should submit a Class 2 technical assessment form to the Helpdesk.

This form should clearly state the project details. All new X2 HECToR users will be required to attend a specific X2 training course.

For more information on the training course in using the X2 please see the X2 course description.

Mixed mode OpenMP/MPI programming on the X2 is restricted to use on a single node.

This is due to the X2 hardware and OpenMP within MPI is not supported for multi-node applications. However, codes written in Co-array Fortran or Unified parallel C are able to address shared memory across nodes. Please note that this restriction does not apply to the XT4.

New Applications

  • The popular WYSIWYG 2D plotting tool Grace has now been compiled and installed on HECToR. Here are the steps to use it:

    login to HECToR using ssh -Y to set up the DISPLAY correctly
    type 'module load grace' to set up the environment
    type 'xmgrace' to start the application.

    Because the application has a graphics user interface and one can do CPU-intensive calculations from within the software, you are strongly  advised to use the serial queue to run this application. To use the serial queue, you need a job script like the following (remember to replace 'budget' with your actual project code).

    #!/bin/bash
    #
    #PBS -q serial
    #PBS -l cput=00:20:00
    #PBS -A budget
    
    cd $PBS_O_WORKDIR
    xmgrace
    
    
  • NAMD 2.7b1 is now available on HECToR. To use this, you have to load the module namd/2.7b1.  As usual, the name of the executable is namd2.  A sample pbs script for namd jobs is available:

      /usr/local/packages/namd/2.7b1/run/run_namd.pbs

    We also provide an executable with reduced memory consumption.  Its name is namd2-memopt.  Please refer to:

      /usr/local/packages/namd/2.7b1/run/run_namd.pbs

  • CASTEP 4.4 is now available for the X2 Vector Nodes. To use this version of CASTEP 4.4 you must add the 'castep/4.4/x2' module to your environment, i.e. use:
    module add castep/4.4/x2
    

    The CASTEP program will then be available via the command 'castep'.

    You can find more information about writing job submission scripts for the X2 nodes in the HECToR User Guide.

  • VASP 4.6 Gamma is now available on HECToR. This works at the gamma point only, and executes 30%-50% faster than the default, full k-point version.

    The package is available with

    module load vasp/4.6_gamma
    

NAMD, Castep and VASP are only available to those users who hold current licenses. To gain access on HECToR one should email their details regarding license arrangements to helpdesk@hector.ac.uk.

Distributed Support

This is also referred to as dCSE support and funding is available to provide extended help with improving the performance of existing HECToR codes and developing high-performance algorithmic improvements for them. Support is also available to port new codes from other systems to HECToR. Awards to support proposed projects are assessed via an independent panel review.

For more information, please see here.

There is a list of the current projects that are underway.

The next application deadline is the 15th June 2009. This coincides with the phase 2 upgrade to quad core and so priority will be given to projects that propose specialist support to address any computational effects of this transition. Applicants will be informed of the outcome of their proposals late July. NAG staff are available to visit institutions to talk about this service. If you are interested in a visit please contact us at [Email address deleted].

Share/Bookmark