We introduced a new user definable parameter , denoting the
number of processor groups each responsible for different set of
points. We want to allocate a set of
points to each process group (
), and then
the processing-nodes inside each group will work on the
points. The cyclic allocation technique is
the most efficient in terms of evenly distributing the work-load. So
that the process-group indices relates to processing-node
indices and -point indices as
(14) |
We introduced arrays and to keep track of which processors and points are allocated to which processor groups. These arrays can be worked out following the steps given in algorithm 2. Note however that the algorithm is order (where is total number of processors), because the same algorithm is repeated on every processor. However the calculations are simple and one only has to do it once at the beginning of the calculation; further more no communication is needed between the processors.
The goal is to define a mapping between the ScaLAPACK block cyclic format distributed over a processor grid consisting of only a subset of processors for a given point and the native CONQUEST format that distributes the rows of the matrices into primary set atoms and partitions across the entire set of processors. In the original CONQUEST implementation, to aid the mapping a reference matrix format is introduced. This is the format for the matrices and which has the same form as if one writes them down on for calculations on paper. The reference format matrices are never physically stored. Also in the original implementation by noticing the fact that the order of atoms in a calculation can be arbitrary, the atomic ordering of the rows of the ScaLAPACK format is chosen to be the same as that of the native CONQUEST format. In other words the atomic ordering (index ) of the ScaLAPACK matrices and is chosen to be the same as the ordering of in matrices and stored in CONQUEST format. We have kept this requirement for the implementation point parallelisation.
|
Figure 11 shows the possible mapping of matrices if there are 8 processors in total and the user have chosen 2 processor groups. The BLACS process grid is assumed to be and the ScaLAPACK blocks are shown in purple boxes. Due to cyclic allocation, the first group will have processors 1, 3, 5, 7 (note we are indexing the processors from 1) and the second group have processors 2, 4, 6, 8. The ScaLAPACK matrix for each group have identical format, and the matrix elements differ only by the way different point is used to generate or matrices from the CONQUEST matrices or . The coloured squares in the CONQUEST matrix shows the periodic images of atoms all belong to the neighbourhood of a given primary set atom. And their contributions are summed up according to equation (18) and stored in the corresponding coloured location in the ScaLAPACK matrix. The numbers inside the small boxes identify the processor responsible for the data block.
Since the atomic ordering of rows and columns of the ScaLAPACK is
unchanged from the original implementation, subroutines
find_SC_row_atoms
, find_ref_row_atoms
and
find_SC_col_atoms
of the ScalapackFormat
module do not
need to be modified. However the rest of the module as well as the
data transfer modules in module DiagModule
do need
modification, to take into account that now the maps to BLACS
processor grid is also dependent on the processor group we are in, and
we have to send one copy of the matrices to each of the processor
groups.
To obtain the eigenvectors the ScaLAPACK subroutine pzhegvx
call is made on every processor, with each processor group using a
different set of
and
(corresponding to a
different ). Upon return from the ScaLAPACK call each
processor group will have a different set of eigenvectors. To build
the density matrix we first map the ScaLAPACK format eigenvectors into
the native CONQUEST format, then accumulate the products of
the eigenvectors one point at a time (see equation
(19)). This is preferred because we can then utilise the
pre-existing code for generating density matrix, and only one storage
array for eigenvectors in CONQUEST format is needed on each
processor--as contributions from eigenvectors are accumulated into
the density matrix. Also performance wise the main benefit of
point parallelisation comes from the freedom it offers for
optimising the ScaLAPACK routine, and building the density matrix is
simply a scalar multiplication and accumulation process hence it is
perfectly efficient to build the density matrix using the parallel
processing format provided by the native CONQUEST approach
(please read [5] for details on how matrix operations
are handled in CONQUEST).
Some attention needs to be paid on the mapping from the output
eigenvectors distributed following the ScaLAPACK block-cyclic format
and the BLACS processor grid, to the distribution format for matrices
used by CONQUEST. First important fact to note is that the
ScaLAPACK subroutine pzhegvx
returns the eigenvectors in
columns and in the same order as the corresponding eigenvalues
(see Figure 12). On the other hand, in
CONQUEST format for each each processor is required
to be in charge of a set of atoms in FSC, in other words, a set of
indices for the eigenvectors
. And unlike the cases with
conventional matrices in CONQUEST which are stored in
row-major format (i.e. not strictly a matrix in FORTRAN 90 sense), the
eigenvectors are stored as a two-dimensional array. We require
the
data associated to each
index to stay in one continuous piece of memory, this means a
column-major storage for the eigenvector. In other words in
CONQUEST format the eigenvectors are stored as
. That is, the eigenvectors are
stored in rows. The order of eigenvalues also changes from the
ScaLAPACK format, because the CONQUEST format ordering should
be the same as that of the reference format.
The steps taken to generate the new density is illustrated in
pseudocode algorithm 3. One has to be careful that we do not
step over the limits since it is highly likely that number of
points in each processor group is different, and hence an
if
statement is required to ensure this.
Lianheng Tong 2011-03-02