A large number of point-to-point MPI communications from many
processes to, say, a master process during a checkpoint operation, can lead to
an MPI "out of unexpected buffer space" error. If the MPI_receive calls
are not pre-posted, the MPI implementation stores any "unexpected" messages
in memory until the receive call is reached. One way to avoid this error is
to increase the size of the unexpected buffer. This is undesirable as this memory
is reserved throughout the job at the expense of that available for the intended
purpose of the job. In this section we will discuss a solution
to this problem using MPI collectives, in the context of the electronic structure
code, CASTEP .
For most uses of CASTEP, the largest single piece of data is
the wave function, essentially represented as a four dimensional,
double precision complex array. The four dimensions are over plane waves
(G-vectors), bands, k-points and spin. In a parallel HPC environment it
is currently possible for CASTEP to distribute the wave function over k-points,
bands and G-vectors. Typical "large" CASTEP calculations have fewer than 10
k-points, often only one (certain classes of phonon calculations require 1000s of
k-points); the order of 100-1000s of bands; G-vectors numbering 104-105; and
spin is always 1 or 2 in the colinear case. The wave function is written to
disk as part of the CASTEP checkpoint process, understandably labelled a `.check'
file, which can reach many Gb in size. During some CASTEP calculations auxiliary
files containing wave functions are also used in order to make jobs restorable
from a partially completed point or for further analysis.
The routines in CASTEP that perform the wave function I/O are
appropriately named wave_read and wave_write. In the current 5.5 release of CASTEP,
the disk I/O of the write operation is done by what we will refer to as the root node.
(Internally, CASTEP refers to processes running on separate nodes. This does not
necessarily mean they are running on physically separate compute nodes.) Each
node sends its wave function data to the master in its band group, which in turn
sends its data to its G-vector master and then passes it on to the root node for
disk output. All this is done using MPI point-to-point communications.
As an aside, we would like to discuss why MPI-IO is not used. The CASTEP .check file is an unformatted fortran file that uses big endian byte order. This is to provide portability across any environment in which CASTEP is used. It was specified by the CASTEP developer group that backwards compatibility with the existing format of the .check file is to be maintained. Using MPI-IO (or NetCDF or HDF5) would allow each MPI process to access the .check file in parallel. NetCDF and HDF5 also provide architecture neutral files. (MPI-IO does allow for machine independent files, but vendors rarely implement this.) However, none of the above methods are compatible with unformatted fortran records and therefore break compatibility with existing CASTEP .check files.