Good Practice Guide

IO

Contents

Introduction

Performing IO properly becomes as important as having efficient numerical algorithms when running large-scale computations on supercomputers. A challenge facing programmers is to understand the capability of the system (including hardware features) in order to apply suitable IO techniques in applications. For example, most operating systems use buffers to temporarily store data to be read from or written to disk. This decouples the interaction between the application and the IO device, which allows the read/write operations to be optimized, and as a consequence it is often more efficient to read/write large chunks of data. Another example is that most supercomputers (including HECToR) have parallel filesystems, which means that it is possible to write to files in parallel over multiple physical disks, a technique called striping. In striping your data it is typically preferable to work with a small number of large files, rather than a large number of small files.

This guide covers some good practices regarding input/output from/to disk. The aim is to improve awareness of basic serial and parallel IO performance issues on HECToR. In the following sections, a variety of practical tips are introduced to help you use HECToR's IO system effectively, so that you can manage large data sets properly.

Data Management

Sharing data with other HECToR users

User account sharing is not allowed on HECToR. One recommended way to share your files with your colleagues and collaborators is to use the /esfs2/transfer directory, which is globally readable by all users.

Transferring data to and from HECToR

Secure-shell based programs (such as 'scp') are often the preferred tools to transfer data to and from HECToR. If you need to move a large amount of data then bbFTP is often more efficient. Before moving large amount of data (in particular ASCII data), it may be beneficial to archive and compress first using utilities such as 'tar', 'gzip' and 'bzip2'. Please remember to use the serial queue for time-consuming jobs to avoid overburdening the login nodes.

Data sharing with other computer systems

Data backup

Please note that data generated by your applications on the /work file system is not backed up. Although the likelihood of losing data is quite low, it is a good practice to regularly transfer important data to your local storage or use the Research Data Facility.

Serial IO

Note that some of the tips in this section can be applied to the majority of computer systems, from desktops to supercomputers.

Writing to Standard Output

On HECToR, standard output is by default written to a shared system location. Therefore writing large amount of data to it may affect other users. Consider using it for debugging only and redirecting standard output from your job script to a file, e.g. aprun [parameters] executable > screen_output

Opening and closing files

The opening, and closing, of files involves a considerable system overhead. Here are a few tips:

As a practical example, an application might write a small amount of information regularly to a log file as computation progresses and write a large amount of scientific data to a data file occasionally. Clearly the former file can remain open and the latter should be opened and closed as appropriate.

Sizes of read/writes

Random access files

In some cases it may be desirable to move through a direct-access file reading only a subset of the data, in particular when data is arranged in fixed-length records. Records may be written to a direct-access file in any order. Careful consideration should be given to the choice of such order to minimise the file pointer movements.

Parallel IO

Parallel IO patterns

Multiple writes to different files

This is perhaps one of the simplest ways to write in parallel. Each process writes its own file, with different file-names generated using unique IDs (such as the MPI ranks).

Advantages

Disadvantages

The best use of this approach is when only a small number of files are involved and the files are only going to be used by the parallel application on the same system (for example restart files). It can be highly effective for systems with local disks attached to each node.

Multiple writes to the same file

In this approach, the user application controls where in the overall file structure each process will write its data. This can be achieved using direct access file or through a high-level library such as MPI-IO.

Advantages

Disadvantages

This approach works well with a small number of processes but may fail to scale to thousands.

Single writer

This is the model in which parallel programs define a single 'master process' to collect data from all other processes for IO. This can be implemented easily using MPI message passing.

Advantages

Disadvantages

Multiple Writer

This is an extension of the single writer model where many processes act as local master processes and write data sent by a group of processes, either to separate files or to different parts of a single file.

Advantages

Disadvantages

It may be possible to map the number of writers to the hardware configurations (number and topology of the physical disks) to achieve best possible performance. However, this is a very advanced technique that requires knowledge of the LUSTRE File System and its optimisations. For large-scale parallel applications, to further address the load balancing issue, it may be possible to introduce dedicated IO processes which receive data from the computation processes and perform IO separately when computations are in progress.

MPI-IO

MPI-IO is a programming interface defined in the MPI 2 standard for performing file read and write operations in parallel applications. This section will very briefly go through the major features of MPI-IO but it does not intend to be a tutorial. Here is a list of resources for further information:

Key MPI-IO features

Case study

Finally, a real-world case-study, taken from a dCSE project, is given to demonstrate the level of improvement that can be achieved when proper IO techniques are employed. The following figure shows the performance of the molecular dynamics application DL_POLY_3. In version 3.09, the code only scales up to about 128 cores due to the degraded IO performance at higher core counts. In version 3.10, while the computational algorithms of the code remain unchanged, significant performance improvement has been obtained by introducing 'multiple writer' model and rearranging data to allow larger IO transactions, making the code scale to thousands of cores. More information about this dCSE project can be found here: IO in DL_POLY_3.

References

Fri Aug 2 09:42:54 BST 2013