Abstract
The intrinsic timescale of correlated electron motion in matter is the attosecond timescale (1 attosecond = 10^{18}s) and just as femtosecond (1 femtosecond = 10^{15}s) laser technology offers us the means of accessing realtime nuclear dynamics, attosecond light pulse technology now promises us the revolutionary ability to measure and control the correlated motion of electrons in atomic systems on their own temporal scale. Recent groundbreaking ultrafast measurements demonstrate that a bottleneck in our understanding of correlated electron dynamics in atoms and molecules is the current state of theory. Correlated electron motion by its very definition is a quantum manybody process and so theoretical methods must go beyond the widely used singleparticle picture so far employed to good effect to describe ultrafast single electron processes. Currently, the most sophisticated method that goes beyond the single particle model is that underlying the HELIUM code. Inspired by the success of HELIUM at describing the correlated motion of electrons within twoelectron systems, the purpose of this dCSE project is to extend the capability of HELIUM to allow it to accurately describe the correlated electron dynamics occurring within complex atoms exposed to intense shortpulse light. This paper describes the underlying numerical methods and algorithms implemented to extend HELIUM into a powerful new Fortran 90/MPI code called RMT (RMatrix incorporating Time) which is unique in its capability to describe the manyelectron response of a general atom to intense shortpulse laser light. Illustrative results for single ionization of neon exposed to a combination of subfemtosecond XUV and 800 nm Ti:sapphire laser light fields, along with a discussion of the performance of the new RMT code on HECToR, are included.
Over the past 10 years there has developed a crucial need to solve directly and accurately the TimeDependent Schrödinger Equation (TDSE) describing the detailed response of manyelectron atoms and molecules to short, intense pulses of light. Such calculations are vital to get the best out of Attosecond Science [1, 2] (1 attosecond (attos) is 10^{18} seconds), the revolutionary goal of which is to track and control the correlated motion of electrons within matter on their own temporal scale. The advent of Attosecond Science is bringing about a revolution in both laser and electron collision physics. Experiments on the Attosecond TimeScale (ATS) are in progress in leading laboratories in the UK [3, 4]; in wider Europe [5, 6, 7]; in North America [8, 9, 10, 11]; in China [12] and in Japan [13, 14]. It is expected that nearfuture advances in Attosecond Light Pulse (ALP) technology [2] will have profound implications for fields as diverse as bioenergetics [15], molecular electronics [16] and even the treatment of cancer [17].
Many of the technological breakthroughs in Attosecond Science over the last decade have come about as a result of advances in theory [18] which now allow for a sound qualitative understanding of the Single Active Electron (SAE) response of an atom or molecule to intense femtosecond (1 femtosecond (fs) is 10^{15} seconds) laser pulses at short and long wavelengths. Indeed, in the long wavelength regime, advances in Ti:Sapphire laser technology [19] combined with an improved understanding of SAE phenomena such as High Harmonic Generation (HHG) [20] have resulted in the recent generation of extreme ultraviolet (XUV) light pulses that last a mere ~ 100 attoseconds in duration [21].
While SAEbased theories have been employed to good effect over the last decade, ALP technology is now demonstrating the revolutionary potential to induce and track ultrafast manyelectron dynamics within atoms and molecules [5, 7]. There is therefore huge importance in the development of tractable methods that are able to treat the correlated motion of electrons in complex atoms and molecules exposed to intense (ultra)short light pulses. Currently, the most sophisticated code for describing correlated electron dynamics within an atomic system is the HELIUM code [24, 25]. HELIUM is a pioneering code in the area of Attosecond Science. Not only has it made possible important scientific discovery (afterwards borne out by laboratory experiment) [25] but it has, in doing so, explored the feasibility of propagating accurate solutions of a multidimensional TDSE. The code is specifically constructed to describe both single and doubleelectron ionization of a twoelectron atom or ion exposed to intense laser light. The code’s continuing success undoubtedly lies in the fact that, from the start, it was designed around MPP architectures [24] (recognized as being absolutely essential to progress in this area of research). The method kernel to the code is a highorder Arnoldi timepropagator [22, 23] combined with Finite Difference (FD) grid methods. By choosing an appropriate distribution over processors HELIUM involves only nearest neighbour communications and this has enabled it to scale highly effectively to more than 16,000 cores on HECToR.
It is this established efficiency, along with the experimental interest in strongly timedependent atomic and molecular systems, that has inspired the designers of HELIUM to exploit its underlying numerical methods to treat complex manyelectron atoms exposed to intense (ultra)short light pulses. In particular, the underlying FD technology of the HELIUM code appears to be well suited for the accurate and efficient description of one or two electrons ejected from a manyelectron atom or molecule during the single or double ionization process. With this capability in mind, the designers of the HELIUM code have recently demonstrated the potential to carry over the FD technology of HELIUM into an accurate description of a general manyelectron atom exposed to intense light pulses. This carry over has been shown to be possible by invoking the powerful divisionofspace concept [26], central to the highly successful Rmatrix theory of atomic and molecular processes [27]. The divisionofspace concept, whereby the position space occupied by the atomic electrons is divided into two separate regions, not only makes possible the integration of FD technology into the description of the ultrafast ionization of manyelectron atoms, but also allows an efficient way of limiting the manyelectron representation of an atom to the only spatial region where it is absolutely necessary, i.e., a small region close to the nucleus. In this finite region it has been wellestablished that the Rmatrix basis set functions offer an accurate and tractable means of representing the manyelectron wave function [27]. Indeed, the combination of the Rmatrix basis set with FD grid techniques brought together with the powerful Arnoldi propagator of the HELIUM code provides the potential to efficiently and accurately describe the full response of a manyelectron atomic system to intense (ultra)short pulse light which has so far proven beyond the capability of all existing theoretical methods.
With this goal in mind, in early 2009 a fiveyear EPSRC Software Development Grant (the UKRAMP Project) was awarded to four collaborating institutions: Queen’s University Belfast (QUB), University College London (UCL), the Open University and STFC CSE, to build upon expertise in all the various areas of ab initio theoretical electron atom and molecule scattering and both timeindependent and timedependent laseratom (molecule) interactions. A major strand of the UKRAMP Project is the carryover of HELIUM code technology to enable the accurate description of the full response of atoms and molecules to intense (ultra)short laser pulses. The first major and vital algorithmic advance required in the pursuit of this goal is the construction of an ab initio code that combines HELIUM code and Rmatrix technology to describe the singleelectron ionization response of a general manyelectron atom exposed to intense (ultra)short light pulses.
The purpose of this dCSE project is to construct a unique application code that will allow for the accurate and efficient description of the single ionization response of a general manyelectron atom exposed to an intense (ultra)short laser light pulse. This project therefore forms the first vital advance in the pursuit of codes that are capable of describing the multiple ionization of atoms as well as the single ionization of molecules exposed to intense light pulses as set out in the UKRAMP Project. As a means of making this advance, the project significantly extends on the recent work carried out on interfacing FD methods with basis set methods for the description of the oneelectron hydrogen atom [26]. In extending the method into one that can accurately and efficiently describe a general manyelectron atom exposed to strong laser fields, three main tasks needed to be accomplished (over a 24 month period) which all required, as a starting point, significant serial code construction:
All of the tasks were completed within the specified timeframes.
This dCSE project has resulted in the construction of a unique application code, now known as the RMT code. The RMT code is a generalpurpose ab initio computer code for accurately describing the manyelectron response leading to single ionization of manyelectron atoms and atomic ions exposed to intense laser light pulses. The ab initio nature of the code allows for not only the description and understanding of future experimental measurements, but will also enable the prediction of physical phenomena occurring in intense laseratom interactions and consequently allow for scientific discovery that will point the way for future experimental investigations at laser labs and large scale facilities over the next decade, at least.
One of the unique features of the RMT code is that it has been designed specifically to be applicable to atoms exposed to wavelengths ranging right across the light spectrum, from the extreme ultraviolet (XUV) photon frequencies being produced at 4^{th} Generation light sources such as FLASH, the future European XFEL facility and the LCLS in the USA, to the infrared (IR) frequencies being produced at Ti:sapphirebased laser labs throughout the UK (e.g., Artemis at the STFC’s Central Laser Facility) and wider Europe (e.g.,‘LaserlabEurope’). Not only does the RMT code have flexibility over wavelength, but because of its genuinely timedependent nature it also has the flexibility to describe the response of atomic systems to both short and longduration light pulses. Moreover, because of the explicit representation of the light field, the code will also be able to support the widerange of experiments investigating the effects of temporal pulseshaping on the response of atoms to laser light, an important step towards future control mechanisms in fields such as photochemistry and molecular electronics.
The ability of the RMT code to describe the dynamics of general manyelectron atoms on the attosecond time scale will also be of immense benefit to all UK research groups developing and applying Attosecond Technology (STFC’s CLF, Imperial College London). The nearfuture promise of highintensity attosecond light pulses holds out the prospect for the realtime observation and control of charge transfer in molecular systems which is expected to have a strong impact on the future of molecular electronics [15] and possibly cancer research [17]. While the RMT code described here will be capable of providing deep insight into the dynamics of electrons in manyelectron atoms exposed to ultrashort pulses, the exciting and genuine potential to adapt the method to describe manyelectron dynamics within molecular systems will rely heavily on the advances that were made during this project.
As a necessity, the starting point for describing the ultrafast dynamics of electrons in an atom exposed to an intense (ultra)short light pulse is the Timedependent Schrödinger Equation (TDSE).
Neglecting relativistic effects, the behaviour of the N+1 electron atomic system in the presence of the laser field is governed by the TDSE
where the Hamiltonian is We have taken the origin of the coordinates to be in the nucleus, which we assume has infinite mass. We have r_{ij} = r_{i}  r_{j} where r_{i} and r_{j} are the vector coordinates of the i^{th} and j^{th} electrons and we have written X_{N+1} ≡ x_{1},x_{2},...,x_{N+1} where x_{i} ≡ r_{i}σ_{i}. Z indicates the nuclear charge and E(t) is the timedependent light field.
In the Rmatrix method [27] the position space occupied by the electrons is divided into two regions: An inner region (region I) surrounding the nucleus where a manyelectron wave function is constructed and manyelectron atomlaser Hamiltonian matrix elements are calculated explicitly, and an outer region (region II), chosen such that only one electron (or at most two) is present and the electron there, besides experiencing the laser field directly, is aware of the remainder of the atomic system only via longrange multipole interactions. Traditionally, Rmatrix theory is a theory where time is not explicitly involved in the study of the collision or photoionisation processes.
Figure 1 displays the division of manyelectron position space which underlies Rmatrix theory. In region I the timedependent wavefunction is expanded over the Rmatrix eigenstates of the fieldfree Hamiltonian. Region I is defined by all N + 1 electrons of the system having a radial coordinate r_{q} ≤ b, q = 1,...,N + 1. Region II is defined by N electrons having a radial coordinate r_{q} ≤ b, q = 1,...,N and one electron having its radial coordinate r_{N+1} > b. In region II the timedependent wavefunction is represented over a finitedifference grid by its values at equidistant grid points r_{N+1}(i) = ih,i = i_{b},i_{b} + 1,...,i_{R}.
Much of the power of the divisionofspace concept is derived from limiting the manyelectron representation of the atom to only the spatial region where it is absolutely necessary, i.e., region I, close to the nucleus of the atom. This means that the number of electrons in region II is limited to just one and hence the TDSE in region II is, for each state of the residualion, reduced in dimensionality to at most three, thus simplifying the computational problem considerably.
In region II, the timedependent N+1 electron wave function can therefore be expanded as Ψ(X_{N+1},t) = ∑ _{p}^{np}Φ_{p}F_{p}(r,t), where the F_{p} functions are singleelectron functions describing the radial motion of the ejected electron (in the p^{th}channel), and where the radial variable of the (N + 1)th electron is denoted as r_{N+1} = r. The Φ_{p} are channel functions formed by coupling the target states of the residual ionic system Φ_{p}(X_{N }) with angular and spin parts of the ejected electron wave function. Note that the time dependence of the wavefunction is contained within the radial functions F_{p}(r,t).
By projecting the known target functions Φ_{p} onto the TDSE and integrating over all spatial variables except the radial coordinate of the ejected electron, the following set of coupled homogeneous partial differential equations (PDEs) for the radial channel functions F_{p}(r,t) is obtained:
Eq.(3) is the evolution equation for the wave function in region II. In eq.(3), W_{E} is referred to as the long range potential in the Rmatrix literature and arises from the electronelectron and electronnuclear potential terms in the Hamiltonian. W_{D} arises from the interaction of the light field with the residual Nelectron ion. The W_{P } potential arises from the interaction of the light field with the ejected electron. H_{ti} is the timeindependent part of the Hamiltonian in region II.
This section describes a standalone code, TDOUTER, that was constructed to solve eq.(3).
Software requirements:
Design strategy:
The TDOUTER code was constructed on top of a welltested oneelectron version of the HELIUM code known as HYDRO. The HYDRO program uses the highly accurate and efficient Arnoldi propagator and FD techniques of HELIUM to propagate a single electron wave function forward in time. HYDRO has been written in Fortran 90 and has been parallelized using a similar strategy to that implemented in HELIUM using the Message Passing Interface (MPI). Since TDOUTER describes the ejected electron in the presence of a manyelectron atom it requires manyelectron atomic structure data to construct the W_{E},W_{D} and W_{P } potentials appearing eq. (3). This data is calculated using the RMATRX2 suite of programs [28] which writes the atomic data to two files, H and d. All modifications to the HYDRO program at this stage of implementation were made using the single core version of the HYDRO program.
Definition of tasks:
In order to construct the TDOUTER program two modifications to HYDRO were required:
HYDRO was designed to solve the singleelectron TDSE using pure FD techniques and therefore it operates over the spatial domain 0 ≤ r ≤ R. The TDOUTER program is required to solve an effective oneelectron TDSE and is also to be interfaced with the basis set techniques (used over the domain 0 ≤ r ≤ b) which underly the TDRA program. Therefore, a new FD grid needed to be laid so that TDOUTER operates over the domain, b ≤ r ≤ R, i.e., from the Rmatrix boundary outwards.
HYDRO was designed to solve the TDSE for the hydrogen atom and is therefore limited to describing a single state of a residual ion (which is the case for hydrogen where only a bare structureless proton is left behind by the ionizing electron). TDOUTER is designed to solve an effective oneelectron TDSE, where due to the manyelectron nature of region I, the ejected electron is coupled to multiple residual ion states via longrange multipole potentials (W_{E}(r) in eq.(3)). The ejected electron can couple to the residual ion states in multiple ways depending on the angular momentum of the ejected electron and the symmetry of the residual ion state, giving rise to the multiple channel functions F_{p}(r,t) of eq.(3). Moreover, these channels are coupled to each other via the timedependent W_{D}(t) and W_{P }(r,t) potentials. All three of the potential terms in the eq.(3) needed to be coded into TDOUTER.

Software implementation phase:
The FD grid was defined as a 1dimensional (1D) grid representing the radial variable r of the ejected electron. The grid was laid out, as shown in figure 2, so that grid points lie at the two boundaries, r = b and r = R. Since TDOUTER is required to compute the radial wave function flux pertaining to multiple states of the residual ion, the F vector as defined in HYDRO had its length increased to account for the extra channels formed by the multiple residual ion states.
The H_{ti} operator in eq.(3) contains the second derivate operator, which is recast in a 5point FD representation. However, the 5point FD rule when applied to the F_{p} functions at the first grid point, i=i_{b}, requires information on the value of the F_{p} functions at points i=i_{b}1 and i=i_{b}2 which lie inside region I, i.e., outside the domain, b ≤ r ≤ R. This can be appreciated if one considers the 5point rule for calculating F_{p} at the left boundary r = b:
where a_{i} are the usual FD parameters for the second derivative. Therefore, to enable the FD Hamiltonian to be applied to the F_{p} functions in region II, the value of the F_{p} functions at grid points i=i_{b}1 and i=i_{b}2 must be first calculated. This requirement poses no problem, so long as the region I wave function has a single electron character at r = b  h and r = b  2h. In practice this means using a region I radius, b, that is slightly larger than the radius one would typically choose in a timeindependent Rmatrix electron scattering calculation.Test phase:
To test TDOUTER, region I was made negligible in size by setting b = h. The points i=i_{b}1 and i=i_{b}2 then correspond to radial distances r = 0 and r = h respectively. This FD grid is the same as that used in HYDRO and therefore similar boundary conditions as are used in HYDRO could be used in TDOUTER to determine values of F_{p} at i=i_{b}1 and i=i_{b}2. The output from TDOUTER was compared with that from HYDRO and was shown to match exactly, thus verifying the accuracy of the TDOUTER code at this stage of construction.
Each residual ion state gives rise to a collection of channel functions, F_{p}, in region II which are coupled via the W_{E}, W_{D} and W_{P } potentials.
W_{E}(r): The spacedependent W_{E}(r) potential is constructed using the longrange multipole coefficients calculated using the RMATRX2 suite of programs [28]. These coefficients are read into TDOUTER from the H file written by RMATRX2.
W_{D}(t): The timedependent W_{D}(t) potential is constructed using the dipole transition elements between the eigenstates of the residual ion which are calculated using the RMATRX2 programs. These matrix elements are read into TDOUTER from the d file written by RMATRX2.
W_{P}(r,t): The time and spacedependent W_{P }(r,t) potential is coded using angular algebra. It takes into account the symmetries of the residual ion states and of the ejected electron, read from the H file written by RMATRX2.
Test phase:
Testing at this stage was not feasible as it relies on the interfacing of TDOUTER with TDRA.
In region I, the timedependent N+1 electron wave function Ψ(X_{N+1},t) is represented over an Rmatrix eigenbasis ψ_{k}(X_{N+1}) as Ψ(X_{N+1},t) = ∑ _{k}C_{k}(t)ψ_{k}(X_{N+1}) where r_{N+1} ≤ b and where C_{k}(t) are timedependent coefficients. The TDSE in the inner region is given by eq.(1). However, in region I, the Hamiltonian H_{N+1} is not Hermitian owing to the presence of the kinetic energy ∇_{i}^{2} term in eq.(2) and the finite value of the wave function on the innerregion boundary. Consequently a Bloch operator _{N+1} = ∑ _{i=1}^{N+1}δ(r_{i} b) is introduced which is such that H_{N+1} + _{N+1} is Hermitian in region I. The TDSE in region I may then be written as
where Ψ_{I}(X_{N+1},t) is the wave function defined over region I in figure 1. By projecting eq. (5) onto the eigenstates ψ_{k}(X_{N+1}) we obtain the evolution equations for the timedependent coefficients C_{k}(t):
wherewhere
At this point it should be emphasized that eq.(6) is fundamental to the RMT method in two ways. Firstly, the inhomogeneous S_{k} term on the righthandside compensates for the Bloch term introduced to make H_{I} Hermitian. Note that it makes a contribution only at r = b and brings into play there Ψ(X_{N+1},t), a wave function form which has been defined throughout both regions. This term is central to any time propagation scheme in region I because it connects the wave function form Ψ_{I}(X_{N+1},t), which is manyelectron in nature, with a wave function form that, at r = b, represents a single electron and which, numerically, is obtained from region II. Secondly, eq.(6) allows for two different numerical methods to be brought together, one (basis set) most appropriate to the manyelectron finite region I and the other (finite difference), most appropriate to the oneelectron region II. In this way eq.(6) is fundamental to future plans to adapt the method to manyelectron molecules.
Software requirements:
Design strategy:
Eq.(6) takes the form of an inhomogeneous TDSE which arises in many areas of quantum dynamics. However, it is only recently that attempts have been made at using highorder explicit methods to solve equations of the type given by eq.(6). One such highorder propagator can be devised by approximating the formal solution explicitly by a Taylor expansion [32]. This is the method used for the singleelectron implementation of the current method [26]. The approach taken in TDRA is to compute the values of the C_{k}(t + δt) coefficients in eq.(8) using an adapted “inhomogeneous” Arnoldi method that in many ways resembles the standard Arnoldi method used for solving the homogeneous TDSE. The full implementation of the adapted Arnoldi algorithm will be described later in section 2.7. In this section a method is first implemented which focuses on propagating the C_{k} coefficients without the driving inhomogeneous terms that will be calculated in region II. In this way, the level of accuracy of the standard homogeneous Arnoldi method combined with the Rmatrix basis set technology is first established.
Definition of tasks:
The construction of the TDRA program requires the implementation of a single task:
Unlike the TDOUTER code, the TDRA program needed to be written from scratch. The first stage in developing TDRA was to construct the manyelectron timedependent Hamiltonian whose matrix elements are calculated by the RMATRX2 codes. The second stage was to code the standard Arnoldi method to timepropagate the manyelectron wave function. Once constructed, the timedependent Hamiltonian is used to timepropagate the manyelectron wave function using the standard Arnoldi method.
Implementation phase:
The ‘starter’ data written to file by RMATRX2 codes is needed to construct the Hamiltonian, H_{I} . The matrix, H_{I}, has a blocktridiagonal structure, with the diagonal blocks containing the fieldfree Hamiltonian for each symmetry of the manyelectron atom (where a symmetry is defined in terms of the total angular momentum, L, the total spin, S, and the total parity, π, of the N+1 electron states). The eigenenergies, E_{kk′}, of the atom making up these diagonal blocks are read from the H file written by RMATRX2. The offdiagonal blocks contain the dipole couplings between the different angular momenta. The dipole matrix elements, D_{kk′}, making up these offdiagonal blocks are read from the d file, also written by the RMATRX2 programs.
As a starting point to solving eq.(8) via an Arnoldibased method, the second term on the righthandside of eq. (8) was set to zero to neglect the inhomogeneous terms and to effectively constrain the laseratom problem purely to region I. This poses no problem so long as region I is set large enough to fully contain the wave function throughout the propagation. This allowed the coding of the Arnoldi algorithm propagating the C_{k} from t to t+δt to be extensively tested in the absence of region II.
An Arnoldi/Lanczos algorithm was coded to propagate the C_{k} coefficients forward in time using the Hamiltonian matrix, H_{I}. Because the dimensions of the tridiagonal blocks making up H_{I} are well defined by the output of the RMATRX2 programs, the kernel matrixvector multiply routine of the Arnoldi algorithm operates only on the nonzero elements of H_{I}. All of the resulting CPUintensive dense matrix algebra of this ArnMatrixVectorMultiply routine was cast into tuned BLAS library routines. The Arnoldi algorithm transforms H_{I} into the Krylov subspace Hamiltonian, h, which is then exponentiated. Two optional routines for exponentiation were coded, one making use of LAPACK subroutines for diagonalization (and QRfactorization) and the other making use of a freely available Padé subroutine [33].
Test phase:
The algorithm was tested rigorously by calculating the twophoton ionization of Ne. The decay in the population of the ground state of Ne was calculated during its interaction with a 6cycle laser pulse with frequencies in the XUV range. Excellent agreement was found when results were compared to those calculated using a separate norder Taylor Propagator.
The construction of the RMT program requires the implementation of three tasks:
Timepropagation of the manyelectron wave function in region I can only begin once higherorder timederivatives of the wave function in region II (at r = b) are evaluated.
Timepropagation of the effective singleelectron wave function in region II can only begin once the wave function on FD points within region I are evaluated.
The Arnoldi algorithm in region I needs to be significantly extended in order to allow for propagation of the inhomogeneous surface terms in eq.(6).
Implementation phase:
In order to construct the U_{j} vectors in eq.(8) the radial derivatives of the F_{p} functions at r = b need to be evaluated. Only a small amount of extra coding in TDOUTER was needed to compute these derivatives using a 5point FD rule.
As already pointed out in section 2.4, the 5point FD rule when applied to the F_{p} functions at the first grid point, i = i_{b}, requires information on the value of the F_{p} functions at points i = i_{b}  1 and i = i_{b}  2 which lie inside the region I boundary.
In order to construct the wave function in region I at these two FD points, the eigenvectors associated with each Rmatrix eigenstate are required. This data is usually not written to file by the RMATRX2 suite of programs, so slight modifications were required to write all eigenvectors to a file named splinewaves. This file is then read by the TDRA program and used in the construction of the FD wave function within region I. The only other data needed to construct the F_{p} inside the boundary, b, are the timedependent coefficients C_{k} and the Bspline basis functions. The same subroutines employed to generate the Bspline functions in the RMATRX2 codes were used in TDRA. At each timestep, the evaluation of the F_{p} functions at grid points within region I amounts to a single call to a matrixvector multiply routine, WVMatrixVectorMultiply.
Test phase:
The TDRA code at this stage was again tested in the absence of region II. The wave function within region I was evaluated on equidistant grid points that spanned the whole of region I, so that the wave function could be analyzed and compared to results produced by an independent basisset approach.
For the calculation of the second term on the righthandside of eq.(8) an Arnoldibased method was implemented to calculate the action of the ϕ_{j}(iδtH_{I}) functions on the U_{j}(t) vectors passed from TDOUTER. The method is similar to that used to calculate the action of e^{iδtHI} on C(t):
The advantage of this formulation is that since the matrix h_{j} has size order n (where n is typically 1214) the evaluation of ϕ_{j}(iδth_{j}) is much cheaper than that of ϕ_{j}(iδtH_{I}).
In recent years, there has been considerable effort made at developing efficient and accurate methods for computing the ϕ_{j}(z) functions appearing in eqs.(8) and (2.5) [31]. In the RMT program, the calculation of the ϕ_{j}(z) functions is tackled by following an idea set out in [34] and generalised in [33] whereby the reduced matrix h_{j} is augmented to form a larger matrix given by
The procedure outlined above is repeated for each of the ϕ_{j}(iδtH_{I})U_{j} terms in the summation on the righthandside of eq.(8). Summing these n_{j} terms and adding to the first term on the righthandside of eq.(8) provides a means of propagating the wave function in region I forward one step in time.
Test phase for tasks C and D:
The accuracy of the adapted Arnoldi method for calculating the action of the ϕ_{j}(iδtH_{I}) functions on the U_{j}(t) vectors was verified by mocking up U_{j}(t) vectors and comparing results to those calculated using direct methods which include explicit inversion of H_{I}. Excellent agreement was found between the two independent methods.
By this stage the single core version of the RMT program was fully constructed. The manyelectron wave function is known at time t + δt throughout regions I and II and further propagation in time can progress by repeating, for successive time steps δt, the procedures described in the previous sections.

Once the single core version of the RMT program was fully constructed, its accuracy was verified by using it to investigate electron wavepackets ejected from Ne (which has 10 electrons) irradiated by an XUV laser pulse. All calculations at this stage were performed on local machines at QUB. Results were compared to those calculated using the TDRA program in which there is no division of position space occurring. The parameters used for these first RMT test calculations can be found in the ‘Ne set A’ list in the appendix of the report. The TDRA calculations used the same parameters except that, because there is no region II included, the boundary of region I was increased to 100 a.u. and the number of Bsplines per angular momentum, l, increased to 100. The partial continuum wave functions of Ne calculated after the end of the laser pulse using the RMT method were compared to those calculated using the TDRA method. The same test calculations were carried out using ’Ne set B’ as input so that the continuum wave functions coupled to an excited residual ion state could be compared. This comparison of timedependent wave functions obtained using the mixed basis/finitedifference RMT method with those obtained using a pure basis set method represents one of the most stringent tests of accuracy for the new RMT method. The excellent agreement between the results of the two independent methods helps verify the accuracy and reliability of the RMT program.

As a second means of demonstrating the accuracy of the RMT method, singleelectron ionization rates were calculated for He and Ne irradiated by a laser pulse with a central wavelength of 248 nm (corresponding to the fundamental wavelength of the KrF laser). These rates were compared with rates obtained using the timeindependent Rmatrix Floquet (RMF) method and, in the case of He, also with those obtained using the HELIUM code. For the He calculations the parameter set ’He’ was used and for the Ne calculations ’Ne set A’ was again used. These parameter sets can be found in the appendix. Table 1 shows a comparison of He singleelectron ionization rates for three intensities. Away from resonances, the ionization rates calculated by RMT agree well (to within 10%) with those calculated by HELIUM [35] and by the RMF method [36]. Figure 3 shows the comparison of Ne singleelectron ionization rates. Agreement between the two sets of results is very good, typically within 10% of each other away from resonance.
This section describes the implementation of a parallel RMT program employing MPI together with tuned BLAS and LAPACK library routines. In the parallel RMT program processes are assigned to two functional groups: 1) the TDRA group of processes calculates the wave function in region I at each timestep, and 2) the TDOUTER group of processes calculates the wave function in region II at each timestep. Twoway exchange of data between the two groups occurs via synchronous pointtopoint communication between the master cores in each of the groups. The approach scales well provided computational load is adequately balanced across the two groups. Results are stored to disk periodically for analysis and to provide a restart procedure.
The TDOUTER group is parallelized by assigning a number of grid points to each core in the group, so that the first core handles the first X_{Last} grid points, the second core the next X_{Last} grid points and so on. Each compute core in the TDOUTER group is thus responsible for a wavefunction array of dimension n_{p} × X_{Last} where n_{p} is the number of channels retained in the calculation.
Because the derivative operators in H_{ti} have been replaced with 5point FD operators, communication between cores in calculating H_{ti}F is limited to nearest neighbour communication only.
Parallelization within the TDRA group has focused on the two most CPUintensive routines of the serial RMT code. These are the ArnMatrixVectorMultiply subroutine of the inhomogeneous Arnoldi algorithm and the WVMatrixVectorMultiply subroutine called during the evaluation of the wave function on the FD points within region I.
ParallelArnMatrixVectorMultiply
Unlike the FD Hamiltonian in region II, the elements of the blocktridiagonal Hamiltonian matrix, H_{I}, in region I are not calculated on the fly which means that options for parallelizing the matrixvector kernel of the inhomogeneous Arnoldi algorithm are more restrictive. Instead all matrix elements are read from files written to disk by RMATRX2. At startup, the master core in the TDRA functional group is the only core that reads the files written by RMATRX2. Once all the H_{I} matrix elements have been read, the strategy used here is to map the block rows, along with the corresponding segment of the U_{j} vectors being multiplied into, onto a group of cores that are from here on referred to as Block Master Cores (BMCs). This already places constraints on the minimum number of cores needed to run the program as at least one core should be available for each LSπ symmetry block. Once distributed, two dense dipole matrix blocks of size mnp1_{symi} × mnp1_{symj} and a 1D array of eigenvalues of length mnp1_{symi} sit on each of the BMCs. At this coarse level of distribution, the ParallelArnParallelMatrixVector subroutine can already be implemented using BLAS library routines. Loadbalancing within the TDRA group is accomplished by ensuring that each block row is assigned the same number of cores and by forcing mnp1_{symi} = mnp1_{symj} by padding the symmetry blocks with zeros (for most input data sets mnp1_{symi} ≈ mnp1_{symj}). Increasing the number of cores within the TDRA group results in multiple cores being assigned to each block row of H_{I}. Due to the offdiagonal block structure of H_{I} twoway exchange of arrays of length mnp1_{symi} occurs between each BMC during each call to ParallelArnMatrixVectorMultiply.
ParallelWVMatrixVectorMultiply
At startup the eigenvectors, V _{k}, associated with each eigenstate, k, of H_{I} are distributed to the BMCs. As soon as the compute cores in the TDRA group evaluate their updated segment of C_{k}(t + δt), an MPI GATHERV is called so that the C_{k}(t + δt) are collected onto the BMCs. The BMCs subsequently call the ParallelWVMatrixVectorMultiply subroutine to calculate the F_{p} value required at each FD point within region I. The wave function values are then gathered onto the master core in the TDRA group to be sent to the master core in the TDOUTER group.
During propagation, communication between TDRA and TDOUTER occurs between the master core in TDRA (MCR1) and the master core in TDOUTER (MCR2). The communication is based on synchronous MPI SEND/RECV calls. The communication calls occur in the following sequence:
At the start of each timestep, MCR1 sends to MCR2 an array, F_{in}, containing the radial wave function at time t at FD points within the Rmatrix boundary. Upon MCR2 receiving this array, propagation over one timestep in TDOUTER can proceed. Therefore, within a single timestep one SEND/RECV call is made between MCR1 and MCR2 for communication from TDRA to TDOUTER.
Within a single timestep, MCR2 calculates a U_{j} vector for each order j, where 0 < j ≤ j_{max}. Once MCR2 has finished calculating a given U_{j} vector, it sends the vector to MCR1. Therefore, the number of SEND/RECV calls made between MCR2 and MCR1 within a single timestep for communication from TDOUTER to TDRA is j_{max} (typically j_{max} = 1214). Upon MCR1 receiving a U_{j} vector, it can proceed with calculating the corresponding ϕ_{j} vector. Once all of the ϕ_{j} vectors have been calculated, the C coefficients at t + δt can be calculated.
The above sequence of calls is repeated for each timestep of the propagation.
Loadbalance between TDRA and TDOUTER
For efficient scaling of the RMT code, computational load needs to be well balanced across the two functional groups TDRA and TDOUTER.
To minimize idle time in TDRA, MCR2 needs to send the U_{j} vectors to MCR1 in synchronization with the calculation of the ϕ_{j} vectors – As soon as ϕ_{j1} is calculated, MCR1 is waiting to receive U_{j} from MCR2 in order to proceed with the calculation of ϕ_{j}.
For a given problem size, there are only two adjustable parameters in the RMT code that can influence load balance between the TDRA and TDOUTER groups. The first is the number of cores assigned to the TDRA group, nc_{tdra}. The second is the number of FD points allocated to each core in TDOUTER, X_{Last}. If the values of nc_{tdra} and X_{Last} are chosen too large the amount of time spent idle on each core within the TDRA functional group will increase. If nc_{tdra} or X_{Last} are chosen too small the amount of idle time within the TDOUTER functional group will increase. The optimal values for nc_{tdra} and X_{Last} can be found by using timing routines in TDRA and TDOUTER along with using the CRAYPAT profiling tool.
The RMT code was ported to HECToR in January 2011. An RMATRX2 suite of codes was also ported to HECToR at the same time so that relevant ‘starter’ data could be generated. The results shown here were produced by compiling the RMT code with the PGI Fortan 90 compiler. The CRAYPAT profiler was used to analyze the initial performance of the RMT code on HECToR.
Comparison with single core version of RMT:
The first runs of the parallel RMT code used the ’Ne set A’ and ’Ne set B’ parameter lists as input. Results were shown to be in excellent agreement with the results produced employing the serial version of the RMT code, run both on the QUB machines and a single core of the XT6. The output of the parallel RMT code was found to agree with the output of the serial version of the code to the level of machine precision.
Scalability:
To test the scalability of the RMT code, several test runs were performed on various numbers of cores on the HECToR XT6. The maximum number of cores allocated to the functional group, TDOUTER, during this initial test phase was 384. The maximum number of cores allocated to the functional group, TDRA, was 48, giving 432 as the maximum total number of cores allocated to a run of the RMT code. All of the scaling tests were carried out using the ’Ne set C’ parameter list in the appendix which to date represents the largest problem size used as input to the RMT program. With this set of parameters, there are 17 residual ion states (resulting in 687 coupled channels) included in the calculation with a H_{I} dimension of ~35000 × 35000.
Table 2 shows how the RMT code scales with increasing number of cores allocated to the TDOUTER functional group. The value of X_{Last} is held fixed, so that an increase in the number of cores results in a linear increase in the volume of integration in region II. The time per iteration is shown for two values of X_{Last}, where increasing X_{Last} by a factor of ~ 4 from 150 to 600 results in a corresponding factor of 4 increase in the time taken per iteration. For both sets of calculations, 24 cores were allocated to the TDRA functional group. It can be clearly seen that weak scaling with up to 384 cores in the TDOUTER group is good on the XT6.

Figure 4 shows how the RMT code scales with increasing number of cores allocated to the TDOUTER functional group for a fixed total number of grid points. The calculations were carried out using ’Ne set C’ as an input parameter set, where R is fixed at 7200 a.u. – with a FD grid spacing of 0.2 a.u., this fixes the total number of grid points to 36000. Figure 4 demonstrates that as the number of cores allocated to the TDOUTER group is doubled, the number of grid points per core, (X_{Last}) is halved. The first scaling analysis was carried out with 24 cores assigned to the TDRA group. In this case, with 384 cores assigned to the TDOUTER group speedup is seen to diverge from ideal. This is because the time spent idle on the cores in the TDOUTER group has increased, whilst there is little idle time within the TDRA group. Once 48 cores are assigned to the TDRA group, the amount of idle time in the TDOUTER group decreases and speedup is seen to return to ideal.
Loadbalancing:
By carrying out a loadbalancing analysis using timing routines and CRAYPAT, an optimum value of X_{Last} can be found for a given number of cores allocated to the TDRA functional group. Using ’Ne set B’ input data, the optimal value of X_{Last} for 24 (48) cores allocated to TDRA was found to be 144 (89).

In order to demonstrate the capability of the RMT code, calculations were carried out to study the interaction of Ne with an XUV attosecond light pulse in the presence of a 800 nm Ti:Sapphire laser field. Such an interaction has recently been studied experimentally where complex manyelectron effects within Ne are suspected to be the reason for an unexpected time delay between the escape of a 2p electron and a 2s electron. Using “attosecond streaking” methods, the experiment has measured this time delay to be 21 attoseconds. So far, no theoretical method has been able to provide a value for the timedelay that even comes close to matching the experimental value. Initial calculations of the time delay demonstrate that such precise phasesensitive information can be calculated with the new RMT program.
In the RMT calculation the 800nm field is a 3 cycle pulse of peak intensity 10^{13} W/cm^{2}. The 91 eV XUV pulse is a 14 cycle pulse also of peak intensity 10^{13} W/cm^{2}. Figure 5 shows the 2D momentum distributions of ejected electron wavepackets from Ne for two different delays between the XUV and the IR fields. The XUV field is at its peak in figure 5(a) at 1.25 cycles into the IR field, and in figure5(b) at 1.75 cycles into the IR field. In (a) the IR field imparts, to the ionizing electron wavepacket, positive momentum in the zdirection, while in (b) it imparts a negative momentum in the zdirection. Consequently the k_{x}k_{z} momentum distribution is shifted upwards in (a) and downwards in (b) as compared to the distribution symmetric about the k_{z}=0 axis obtained when there is no IR field present.
Both (a) and (b) figures show contributions to the momentum distribution from electrons liberated from the 2p subshell (outer circle) and from the 2s subshell (inner circle). It is worth noting that the due to the RMT code’s ability to explicitly include multiple residual ion states in the calculation, it is one of the few methods worldwide that is capable of describing the momentum distributions of the ejected 2p and 2s electrons in a single calculation. By performing further calculations with various delays between the fields it is observed that the inner and outer circles oscillate between their positions in (a) and (b). Analysis of these oscillations reveals an ultrashort time delay in photoemission from the two valence states (2p and 2s). It is the intention of the authors to present full details of these exciting results at the XXVII International Conference on Photonic Electronic and Atomic Collisions (ICPEAC) 2011 conference and subsequently to submit results for publication in a peer reviewed journal.
This report has described the construction of a powerful new ab initio RMT code that has the unique capability to accurately and efficiently describe the single ionization response of a general manyelectron atom exposed to intense laser light pulses. The construction of the RMT code has been made possible by implementing the numerical methods and algorithms underlying two wellestablished ab initio codes, namely HELIUM and the RMATRX2 suite of codes. The RMT code has immense flexibility since it can not only describe the response of a general manyelectron atom to intense laser light fields with wavelengths ranging throughout the spectrum, from the extreme ultraviolet to the infrared, but also to both short and longduration pulses of such light.
The construction of the RMT code constitutes a major algorithmic advance that represents a vital first step in the development of a method that is capable of accurately describing the single ionization response of a manyelectron molecule to intense laser light and later, the doubleelectron ionization response of a manyelectron atom exposed to such light.
The accuracy of the new code has been verified by using it to calculate single ionization rates for He and Ne irradiated by a KrF laser pulse. In the limit of manycycle laser pulses, a regime in which Rmatrix Floquet (RMF) theory is valid, results from the RMT code and RMF code have shown excellent agreement. In the case of He, RMT code results were also shown to be in excellent agreement with those produced by the HELIUM code.
A description of the parallelization of the RMT code and its porting to the HECToR Cray XT6 has been provided. Running the parallel RMT code on the XT6 machine has already revealed the code’s potential to accurately describe the effects of highly complex correlated electron dynamics on the ultrafast photoemission of electrons from manyelectron atoms when exposed to a combination of XUV and 800 nm laser light.
The report also provides a performance and loadbalancing analysis of the new code on the XT6. The RMT code has been shown to scale well, provided computational load is balanced between the TDRA and TDOUTER functional groups working on region I and region II of the electronic position space.
However, modifications to the code need to be made in order to maximize performance on the current generation of highend computing machines such as the HECToR Cray XT6, typically containing many thousands of multicore processors. Improvements to parallel scaling performance, single node efficiency and memory usage will enable very large laseratom and laserion calculations to be addressed on these machines. Such largescale calculations are essential to complement the sophisticated laboratorybased experiments being carried out around the world in the field of Attosecond Science. In particular, effective load balancing between the TDRA and TDOUTER functional groups will be paramount to the future use of the RMT method in the investigations of the double ionization of atoms to intense laser fields. The optimization of the RMT code for use on HECToR will be the focus of an upcoming 6month dCSE project.
The authors would like to thank the following people for their contributions to this work:
Prof. K. T. Taylor, Queen’s University Belfast
Dr. J. S. Parker, Queen’s University Belfast
Dr. H. W. van der Hart, Queen’s University Belfast
Dr. L. A. A. Nikolopoulos, Dublin City University
Dr. M. Plummer, STFC Daresbury Laboratory
This project was funded for 24 months under the HECToR distributed CSE (dCSE) programme, which is provided through The Numerical Algorithms Group (NAG) Ltd. LRM was funded for the first 6 months of the project to work on the FDOUTER program. MAL was funded for the remaining 18 months of the project.

[1] P. Corkum and F. Krausz Nat. Phys. 3, 381 (2007)
[2] F. Krausz and M. Ivanov Rev. Mod. Phys. 81, 163 (2009)
[3] S. Baker et al Science 312, 424 (2006)
[4] L. E. Chipperfield et al Phys. Rev. Lett. 102, 063003 (2009)
[5] M. Uiberacker et al Nature 446, 627 (2007)
[6] E. Goulielmakis et al Nature 466, 739 (2010)
[7] M. Schultze et al Science 328, 1658 (2010)
[8] N. Dudovich et al Nature Physics 2, 781 (2006)
[9] H. J. Wörner et al Nature 446, 604 (2010)
[10] K. P. Singh et al Phys. Rev. Lett. 104, 023001 (2010)
[11] H. Wang et al Phys. Rev. Lett. 105, 143002 (2010)
[12] Y. Zeng et al Phys. Rev. Lett.103, 043904 (2010)
[13] Y. Nabekawa et al Phys. Rev. Lett.102, 213904 (2009)
[14] E. Takahashi et al Phys. Rev. Lett.104, 233901 (2010)
[15] J. VuraWeis et al Science 328, 1547 (2010)
[16] M. Stockman et al Nature Photonics 1, 539 (2007)
[17] H. B. Gray and J. Halpern Proc. Nat. Ac. Sci. 102, 3533 (2005)
[18] P. Corkum Phys. Rev. Lett. 71, 1994 (1993)
[19] T. Brabec and F. Krausz Rev. Mod. Phys 72, 545 (2000)
[20] M. Lewenstein et al Phys. Rev. A 49, 2117 (1994)
[21] G. Sansone et al Science 314, 443 (2006)
[22] W.E. Arnoldi Quart. Appl. Math. 9, 17 (1951)
[23] T.J. Park, J.C. Light J. Chem. Phys. 85, 5870 (1986)
[24] E.S. Smyth, J.S. Parker, K.T. Taylor Comput. Phys. Commun. 114, 1 (1998)
[25] J.S. Parker et al Phys. Rev. Lett. 96, 133001 (2006)
[26] L.A.A. Nikolopoulos, J.S. Parker, K.T. Taylor Phys. Rev. A 78, 063420 (2008)
[27] P.G. Burke, K.A. Berrington Atomic and Molecular Processes: An Rmatrix Approach, (IOP Publishing, Bristol) (1993)
[28] P.G. Burke, V.M. Burke, K.Dunseath J. Phys. B: At. Mol. Opt. Phys. 27, 5341 (1994)
[29] A. G. Sunderland et al Comp. Phys. Commun. 145, 311 (2002)
[30] A. G. Sunderland, C. J. Noble and M. Plummer dCSE Technical Report (2010)
[31] B. Skaflestad, W.M. Wright Appl. Numer. Math. 59, 783 (2009)
[32] M. Ndong, H. TalEzer, R. Kosloff. C.P. Koch J. Chem. Phys. 130, 124108 (2009)
[33] R. Sidje ACM Trans. on Math. Soft. 24, 130 (1998)
[34] Y. Saad SIAM J. Numer. Anal. 29, 209 (1992)
[35] J.S. Parker et al J. Phys. B: At. Mol. Opt. Phys. 33, L239 (2000)
[36] D.H. Glass, P.G. Burke J. Phys. B: At. Mol. Opt. Phys. 32, 407 (1999)
[37] P.G. Burke and K. T. Taylor J. Phys. B: At. Mol. Opt. Phys. 8, 2620 (1975)