Code Status Before Start of Project

The UKRMol-in suite had been run on a Linux cluster at the OU, on the Dell Legion cluster at UCL, on a SGI machine and on Linux workstations (the most powerful with 64Gb memory and 8 processors). Most ``production'' runs were serial and could take anything from a few hours to up to a couple of months (where more than 99% of the time is taken by the Hamiltonian diagonalization step; a checkpoint scheme being in place so that this step can be re-started if necessary). Memory and disk requirements vary but for large runs these could be of tens of Gb each. For example, a medium-sized calculation for thymine (C5H6N2O2) using a cc-pVDZ basis set and a (14,10) active space will generate a Hamiltonian of size around 185000 $\times$ 185000. SCATCI requires 422 minutes of CPU time just for the Hamiltonian construction. Using the serial ARPACK diagonalizer, $\sim$ 33Gb of memory are used when 5000 eigenvectors are requested and the job takes 2606 minutes. (CONGEN takes around 76s CPU time to generate the configurations; all times are for Intel Xeon CPU E5540 @ 2.53GHz CPU and executable compiled with Intel v11.1). An OpenMP version of the program has been produced by Dr Pavlos Galiatsatos which uses the ARPACK diagonalisation routine. This current implementation performs as follows on a 64Gb Linux workstation (Xeon processors) for a 524,000 matrix dimension when 3000 roots are requested (this OpenMP version has also been run on an SGI machine): using four cores the speed-up is 2.5; using 8 cores the speed-up is 4.0. The current serial version of SCATCI interfaces with the well known ARPACK library which is based upon an algorithmic variant of the Arnoldi/Lanczos process called the Implicitly Restarted Arnoldi/Lanczos Method (IRAM). The ARPACK library adopts a reverse communication interface which, in practice, means a spMV multiplication mechanism needs to be provided when interfacing to the library. In the current serial version of SCATCI, spMV multiplication is implemented within the subroutine MKARP. Within MKARP, the lower triangular part of the symmetric $H^{N+1}$ matrix is read into RAM from a sequential access file (where it is stored in an unordered coordinate (COO) symmetric storage format). In order to obtain the eigenvalues and eigenvectors, a loop until convergence is performed with successive calls to ARPACK's DSAUPD subroutine, which requires the reverse communication interface and where spMV multiplication mechanism is provided in house.

Most of the compute time within IRAM is spent in the spMV multiplication stage and therefore initial efforts have focused on parallelizing this stage of the algorithm. To this end, simple modifications to the existing serial spMV multiplication mechanism have been introduced using OpenMP DO directives. Since many of the subroutines within ARPACK interface with LAPACK and BLAS subroutines, multi-threaded parallelism was also introduced to these stages within the IRAM algorithm by linking to optimized threaded LAPACK and BLAS libraries such as Intel's Math Kernel Library (MKL) with the inclusion of relevant flags at compile time.

Paul Roberts 2012-06-01