HECToR

Welcome to HECToRNews 2, December 2008

Featuring:

Introduction
Training
Serial Queues and Output from Batch Jobs
Code Issues
Application forms for time on HECToR
Distributed Support

Introduction

This is the second Newsletter for HECToR users from the Computational Science and Engineering support (CSE) team of NAG Ltd. The HECToR newsletter intends to keep users updated with useful information on the National Supercomputing Service. You can also read the first issue.

In this issue we have information on HECToR related training courses, general updates regarding the HECToR environment, support issues and information on the distributed support service.

Training

The High Performance Computing (HPC) and HECToR training courses run by NAG Ltd. are provided free of charge to HECToR users and UK academics whose work is funded by one of the participating research councils (EPSRC, NERC and BBSRC). Courses on MPI, OpenMP and Mixed MPI/OpenMP programming techniques are being held in January at Imperial College, London. The introduction to HECToR and Tools and Techniques for Optimising parallel codes courses will be held early February at the Oxford Office of NAG Ltd.

A non-HPC Fortran 95 course is now being offered to give training in scientific programming. This is suitable for anyone who would like to learn the language from scratch or who would like to update and build on their current knowledge. The course will be run at the Manchester Office of NAG Ltd. 24-26 Feb and then later in March at the University of Southampton. Further times and venues will be pulicised as they are decided on. For more information on the course times and locations please see the training schedule or contact [Email address deleted]

Specific training courses in the major application codes for Computational Chemistry and Engineering are also available. Please see the application course schedule for further details.

Serial Queues and Output from PBS

Serial Queues

When you login to HECToR you now see the message SERIAL BATCH JOBS NOW SUPPORTED. This feature has recently been added for jobs such as long compilations, large external file transfers or post processing of data. If a user knows that their serial job is quite intensive, i.e. the job uses more than 10 minutes of CPU time in a 30 minute period, then they should use the serial batch queues. More information and a sample serial queue batch script (bash) can be found in the User Guide at: Batch Processing - Serial Queues. The etiquette for using HECToR has also been updated and is formalised in The HECToR Code of Conduct.

Output from PBS

It is useful to note that the output from PBS does not give an summary of the resources allocated on the compute nodes. It does tell you the resources requested e.g.:

Resources requested: 
mpparch=XT,mppnppn=2,mppwidth=6,ncpus=1,place=pack,walltime=00:02:00

but the allocated resources statements do not relate to resources on the compute nodes but to those on the login node from which the job was submitted, e.g.:

Resources allocated: 
cpupercent=0,cput=00:00:00,mem=2080kb,ncpus=1,vmem=22244kb,walltime=00:00:02

Please see the following section in the HECToR user guide Batch Processing - Output from PBS pro jobs.

Code issues

It has become evident from users' queries that there is a bug in DFT Gradients of NWChem version 5.0. Erroneous results in DFT gradients can occur when CD fitting is used. This bug has been fixed in NWChem version 5.1 which is now available on HECToR. This latest version of NWChem uses Cray portals and is more efficient and scalable. Version 5.0 had quite a few problems relating to memory requirements under certain configurations. It is hoped that the majority of these can now be solved by using version 5.1. Try it with module load nwchem/5.1.

We were recently porting a user's Fortran 90 CFD code to HECToR. Initially, the code compiled on HECToR with the PGI compiler but gave the incorrect numerical results when it ran. To debug the code we tested it without optimisation and used several common checking flags but this did not find the problem. The NAG Fortran compiler helped us to remove all non-standard features of the code, which were causing some of the problems, and finally Totalview help us to track down the offending division by zero. The code now compiles and runs successfully under the PGI, Pathscale and GNU compilers with the PGI compiler giving an overall performance increase of 10% over previous results.

Application forms for time on HECToR

You can access the form to apply for access time to HECToR online. This must be completed in order to apply for a resource allocation of AUs (allocation units) for a project.

When you make your first application for HECToR resources the AU calculations can be a little bit tricky. It is important to note that within the form a processor refers to a dual-core XT4 node (or a quad-core XT4 node for Phases 2 and 3) or an X2 quad-core vector node (whichever is applicable). Separate tables for XT4 and X2 job profiles should be given if applicable. As a first step, a good method is to calculate your CPU time using the processor hour unit. If you know the number of processors and the number of hours of wall clock time that your jobs will run with, then you simply multiply them. For example, if you have a job that will run with 512 MPI processes for 6 hours. Then, in Phase 1 if the job can use each dual core XT4 node (processor) entirely, the job will run with 256 x 6 = 1536 (processor hours). To get the AUs required here you multiply the processor hours by 10 for the AUs for Phase 1. But, in Phase 2 (and currently Phase 3) we assume that if the job can now use each quad core XT4 node (processor) entirely, then the job will need 128 processors.

For example :
Phase 1 (Oct 2007 - Sep 2009): 1536 x 10 = 15,360 (AU).
Phase 2 (Oct 2009 - Sep 2011): 768 x 40 = 30,720 (AU).
Phase 3 (Oct 2011 - Sep 2013): 768 x 40 = 30,720 (AU).

For Vector X2 jobs, you multiply 1536 by 20 for all Phase 1, 2, and 3: 1536 x 20 = 30,720 (AU).

The different multiplication factors for Phase 1, 2 and 3 occur because they have been calculated from the overall total peak performance in (Tflops) of HECToR. For Phases 2 and 3 this is not certain as the hardware has not been finalised, so we assume that the peak performance will at least double. This is why we multiply the CPU time by 40 rather 20 in Phase 2 and 3. It is worth noting, however, that the notional cost will be less for each subsequent phase. Please also see Cost of Access to HECToR.

If you only want to use a single core per node, possibly due to memory requirements, then you will be charged for the entire node as nobody else can run their code on that particular node while it is allocated to your job.

Distributed Support

This is also referred to as dCSE support. dCSE funding is available to provide extended help with improving the performance of existing HECToR codes and developing high-performance algorithmic improvements. Support is also available to port new codes from other systems to HECToR. Awards for proposed dCSE projects are assessed via an independent review panel. For more information, please see the dCSE Section.

The next application deadline is the 23 March 2009. Applicants from the December round will be informed of the outcome of their proposals at the end of January. NAG staff are available to visit institutions to talk about this service. If you are interested in a visit please contact us at [Email address deleted].

Main web site navigation