Introduction

A large number of point-to-point MPI communications from many processes to, say, a master process during a checkpoint operation, can lead to an MPI "out of unexpected buffer space" error. If the MPI_receive calls are not pre-posted, the MPI implementation stores any "unexpected" messages in memory until the receive call is reached. One way to avoid this error is to increase the size of the unexpected buffer. This is undesirable as this memory is reserved throughout the job at the expense of that available for the intended purpose of the job. In this section we will discuss a solution to this problem using MPI collectives, in the context of the electronic structure code, CASTEP [1].

For most uses of CASTEP, the largest single piece of data is the wave function, essentially represented as a four dimensional, double precision complex array. The four dimensions are over plane waves (G-vectors), bands, k-points and spin. In a parallel HPC environment it is currently possible for CASTEP to distribute the wave function over k-points, bands and G-vectors. Typical "large" CASTEP calculations have fewer than 10 k-points, often only one (certain classes of phonon calculations require 1000s of k-points); the order of 100-1000s of bands; G-vectors numbering 104-105; and spin is always 1 or 2 in the colinear case. The wave function is written to disk as part of the CASTEP checkpoint process, understandably labelled a `.check' file, which can reach many Gb in size. During some CASTEP calculations auxiliary files containing wave functions are also used in order to make jobs restorable from a partially completed point or for further analysis.

The routines in CASTEP that perform the wave function I/O are appropriately named wave_read and wave_write. In the current 5.5 release of CASTEP, the disk I/O of the write operation is done by what we will refer to as the root node. (Internally, CASTEP refers to processes running on separate nodes. This does not necessarily mean they are running on physically separate compute nodes.) Each node sends its wave function data to the master in its band group, which in turn sends its data to its G-vector master and then passes it on to the root node for disk output. All this is done using MPI point-to-point communications.

As an aside, we would like to discuss why MPI-IO is not used. The CASTEP .check file is an unformatted fortran file that uses big endian byte order. This is to provide portability across any environment in which CASTEP is used. It was specified by the CASTEP developer group that backwards compatibility with the existing format of the .check file is to be maintained. Using MPI-IO (or NetCDF or HDF5) would allow each MPI process to access the .check file in parallel. NetCDF and HDF5 also provide architecture neutral files. (MPI-IO does allow for machine independent files, but vendors rarely implement this.) However, none of the above methods are compatible with unformatted fortran records and therefore break compatibility with existing CASTEP .check files.