A survey of CASINO and of the dCSE problems

CASINO is a QMC software package developed and maintained over the last 10 years at Cavendish Laboratory, Cambridge University, UK [2,3].

In its core CASINO's QMC algorithm uses configurations of $ N_e$ three dimensional random walkers (RW) which evolve according to the distribution probability associated to the trial wavefunction for the VMC calculation, Eq ([*]), or to the Schrodinger equation for the DMC calculation, Eq ([*]).

A typical calculation involves the following main steps:

Two main factors determine CASINO's performance for the function of the number of electrons in a given model: i)the memory needed to store orbitals values for the Monte Carlo algorithm and ii) the scaling of the computation time with the electron number [2].

Besides these problems it was found in practice that for parallel computation of models containing more than 1000 electrons and for computations with or on more that 5000 tasks there is an extremely long reading time of the orbitals' data or of the previously stored configurations (30 to 60 minutes).

In the following sections of this dCSE report we shall present in detail the proposed solutions for the above problems and discuss the performance gains they bring to the code. Section [*] describes the algorithms used for shared OPO data and the results of the performance measurements for each algorithm. Sections [*], [*] presents the second level parallelism algorithms and their performance is analysed. Section [*] presents the improvements for the reading of initial data. The benchmark tests were done on the dual core AMD Opteron which were used in phase I of HECToR and on the quad core AMD Opteron that are currently in use in phase II of HECToR. The dual core processor has the following technical specifications: 2.8 GHz clock rate, 6 GB of RAM, 64KB L1 cache, 1MB L2 cache, peak performance 11.2 Gflops in double precision. The quad core processor has the following technical specifications: 2.3 GHz clock rate, 8 GB of RAM, 64 L1 cache, 512 KB L2 cache, 2 MB L3 cache(shared), peak performance close to 40 Gflops in double precision. The code was compiled with PGI v8.0.{2,6}.

Figure: Illustration of the BC data access patterns for two tasks in various algorithms. Each task computes in serial mode a list of configurations $ C_{1}\dots C_{N_t}$ using the stored $ BC$: a) The standard version, each task has it own BC data set, b) BC are located on a sector of shared memory, c) MPI two-sided version: the BC data are transferred using MPI two-sided communication, d) one-sided data transfers that can be implemented with MPI or SHMEM.
Image fig1