Good Practice Guide

Performance Measurement: XE6

Contents

Introduction

The purpose of this guide is to suggest how to monitor the runtime behaviour of your application and hence to better guide your performance tuning/optimisations on the XE6 machine. This process is called performance measurement and performance measurements tools (alternatively called profiling tools) are available on HECToR to help you identify the bottlenecks of your code. This guide covers what and how to measure. It should be noted that this guide does not discuss what to do if your application is not performing well - see the Serial Code Optimisation Guide and the Parallel Optimisation Guide for tips on how to optimise your code.

What to measure

The bottleneck of an application can be computation, communication or I/O, hence this section concentrates on aspects of performance at three levels: CPU core, parallelism and I/O.

Performance of a single core

The most important metric at the level of a single core is FLOPS (Floating Point Operations per Second). This measure of computational throughput gives a good first indication of performance. The FLOPS rate you expect to achieve should not depend on the capabilities of HECToR alone, but also on characteristics of your application: the algorithms you employ, how these algorithms have been programmed to use the hardware, what libraries you are using, and how you have compiled your code. For example, it is unrealistic to expect your code to match the performance of the LINPACK benchmark if you are using a naive algorithm, you have non-unit memory strides, or you have compiled without optimisation. If you have an idea of the kind of FLOPS rate you expect from HECToR but have observed a shortfall, the next step is to investigate what is causing the slowdown.

Since the Opteron is a cache-based architecture another useful measurement is the number and rate of cache misses. A high rate of cache misses may be caused by a number of factors such as non-unit striding through arrays or cache thrashing. CPU time spent waiting for memory transfers may be time that could be spent performing computation if the memory layout in your program is re-arranged. An ever more important metric is the measurement of Translation Lookaside Buffer (TLB) misses, which are even more costly than the cache misses.

Parallel performance

Suppose that most threads/processes of your parallel application have a low FLOPS per core rate, but appear to be making use of vectorization and working from L1 cache. In this situation it is necessary to focus attention away from what is happening in a single core and investigate interaction with other cores. Such a problem is likely to be caused by load imbalance - computation in the majority of cores is limited because of a lack of work. A basic measurement of load balance involves timing computation in each core and finding the maximum and average times; calculating max - avg gives the amount of time to be saved by balancing load fairly.

In an MPI application some cores may be held up because of time spent waiting for messages to arrive, or time spent waiting at barriers. It is always important to consider whether communications or barriers are really necessary. Good practice is to monitor who communicates with who, the amount of data transferred, the time taken for messages to arrive once sent (message latency) and always time barriers. Such information points to where attention should be focused when making your parallel code fairer.

I/O performance

I/O is a difficult activity to measure accurately on HECToR. For example, when a write routine is called data may be buffered on compute nodes, I/O nodes or in the Lustre filesystem before actually being written to disk, and the routine does not block until data is written. This means that the time taken to perform the write routine does not accurately reflect the time taken to perform the write to disk. Similarly, multiple reads from disk may be cached, and so measuring a read does not necessarily measure the time taken to read from disk. However, it is still useful to time read and write routines, if only to spot anomalies. It may also be useful to measure the amount of data being written or read - is it all necessary? If the suggestions given in the Good practice guide for I/O guide are followed then I/O should not be a significant issue.

How to Measure

Performance measurement for benchmarking may be as simple as calling a routine such as MPI_WTIME, which is sufficient for most timing measurements. The other option is to use profiling tools such as CrayPAT, Scalasca, TAU or HPCToolkit. All are available on HECToR and can provide valuable information for the performance of your application, reporting useful metrics related to the efficiency of your communication patterns, wait-synchronization states etc. CrayPAT, Scalasca and TAU take the approach of instrumenting executables, however there can also be an overhead associated with using these tools, particularly when tracing experiments are performed. Even for HPCToolkit which performs sampling experiments the overhead can be up to 5%.

Performing your own timings

There are two types of timing measurement to consider: (i) CPU time, which is the amount of time the CPU spends working on your code; (ii) wall-clock time, which is the time elapsed between invocation and termination of your code (i.e. this time could be measured by an ordinary clock on the wall). CPU time is less than or equal to wall-clock time since wall-clock time includes time taken by system processes. The difference on HECToR should be minimal however, because CLE (the Operating System on the compute nodes) is designed to be as unobtrusive as possible.

CPU time may be accessed by the Fortran subroutine CPU_TIME(REAL (*)). Wall-clock time may be measured by calling the MPI function MPI_WTIME, which returns a double precision timing value (note that MPI_WTIME is not globally synchronised on HECToR).

These timing routines should be placed around the parts of your program that are required to get the type of timings discussed in the section above, What to measure. For example, it is good practice to measure time taken by solvers, matrix assembly, I/O, MPI routines (sends/receives/barriers/collectives). In each case this may mean timing library routines. It may not be necessary or practical to measure each pass through a section of code (in particular, it is not a good idea to put timing routines in tight loops), so try to target specific calls. An example of using the MPI_WTIME function is given below.

        dt1 = MPI_WTIME()
        do i=1,n
         ! perform computation...
        end do
        dt2 = MPI_WTIME()
        ! dt = time spent in loop
        dt = dt2 - dt1
      
It is always important to take timings more than once before making assumptions about your code because a single timing result may be anomalous. It is also important to be consistent in which value you use (e.g. mean, median, maximum, minimum) for comparisons.

In the case that a lot of different sections of code are being timed it is good practice to manage results in wrapper routines. For example, define two routines timer_start(event_id) and timer_stop(event_id). In these routines we can store the time for each event, the total number of events, and calculate other useful values like the mean time for an event or the cumulative time for each process. These routines can also handle the task of associating events with meaningful names so that, for example, we remember the reason for our measurements in a few months time. Also, timing data should be stored in memory as it is collected with periodic processing to reduce the size of data structures. Results should not be printed as collected because this will likely present you with too much information to digest, and the cost of printing adds an overhead. Printing should be done periodically, or even only at the end of a run, before MPI_FINALIZE.

Using CrayPAT for Performance Measurement

CrayPAT is a very useful and easy to use tool for benchmarking. Gathering timing results from CrayPAT output is usually straightforward and easy to understand. You may configure your application via pat_build to collect as much or as little timing information as you like, such as user routines and specific library routines (e.g. MPI, I/O). It is also possible to use the CrayPAT API to time specific sections of code in a similar way to the discussion above about using your own timing routines. However, most useful is the information that CrayPAT can provide but which is very difficult to collect otherwise, such as hardware counter information and in particular derived metrics. Furthermore information on MPI messages, synchronization and imbalance will help resolve scaling issues, when the bottleneck of the code are communication or synchronization.

This section does not discuss how to use CrayPAT, only what to look for once you have your results. A guide to using CrayPAT is given in the User Guide to the HECToR Service: Tools (also see the CrayPAT user guide). We will look at the output for three pre-defined CrayPAT hardware counter groups that give useful values for some of the metrics discussed in the section above What to measure.

Hardware group 1: Summary with TLB metrics

When the environment variable PAT_RT_HWPC is set to 1 CrayPAT will report hardware counters that give a good overview of performance, as efficient use of cache and TLB (Translation Lookaside Buffer) cache have a crucial impact on the application's performance on HECToR. The sample output below shows the four hardware counters used in this group: PAPI_L1_DCM (level 1 data cache misses), PAPI_TLB_DM (data translation lookaside buffer misses), PAPI_L1_DCA (level 1 data cache accesses) and PAPI_FP_OPS (the number of floating point operations).

More useful than these raw data are the derived values lower down the table (highlighted). For example, HW FP Ops / User time gives a measure of FLOPS (DP = double precision) and a rating of how this compares to the theoretical peak, 9.2GFLOPS. As a guideline, Cray suggest that anything between 10% and 20% is making good use of the machine. Anything above this is very good, but anything below should be optimised.

========================================================================
USER
------------------------------------------------------------------------
 Time%                                        21.8%
 Time                                      2.313276 secs
 Imb.Time                                        -- secs
 Imb.Time%                                       --
 Calls                      0.079M/sec     154995.0 calls
 PAPI_L1_DCM               11.326M/sec     22219097 misses
 PAPI_TLB_DM                0.001M/sec         2230 misses
 PAPI_L1_DCA             1601.846M/sec   3142421662 refs
 PAPI_FP_OPS             3335.765M/sec   6543937895 ops
 User time (approx)         1.962 secs   4512025000 cycles   84.8%Time
 Average Time per Call                     0.000015 sec
 CrayPat Overhead : Time     8.9%
 HW FP Ops / User time   3335.765M/sec   6543937895 ops  36.3%peak(DP)
 HW FP Ops / WCT         2828.861M/sec                        
 Computational intensity     1.45 ops/cycle    2.08 ops/ref  
 MFLOPS (aggregate)     133430.62M/sec
 TLB utilization       1408889.19 refs/miss    2752 avg uses 
 D1 cache hit,miss ratios   99.3% hits         0.7% misses   
 D1 cache utilization (M)  141.43 refs/miss  17.679 avg uses 
========================================================================
    
Computational intensity is calculated as the number of operations per data reference. This value should be around 1 or greater, which implies that the functional units are working on data in registers. TLB utilization (avg uses) should be several hundreds (the larger the better). This is a measure of how many memory references resulted in a TLB hit per miss. D1 cache hit ratio should be close to 100%. D1 cache utilization (M) should be at least 8 average uses for double precision values and at least 16 for single precision. These figures match the length of a cache line and therefore imply that each value in a cache line is on average being referenced at least once.

Hardware group 2: cache (L1 and L2 metrics)

More specific cache information, including details about level 2 cache, are given in hardware group 2. Of specific interest in the table below are the combined level 1 and level 2 cache hit ratio ( D1+D2 cache hit,miss ratio) and the combined level 1 and level 2 utilization. The total cache hit ratio is useful because the L2 cache serves as a victim cache for L1. If this value is not close to 100% then your program is not making best use of the memory hierarchy.

========================================================================
USER
------------------------------------------------------------------------
 Time%                                           29.6%
 Time                                         4.114811 secs
 Imb.Time                                           -- secs
 Imb.Time%                                          --
 Calls                        0.046M/sec      155035.0 calls
 REQUESTS_TO_L2:DATA        137.888M/sec     465761630 req
 DATA_CACHE_REFILLS:
   L2_MODIFIED:L2_OWNED:
   L2_EXCLUSIVE:L2_SHARED   103.646M/sec     350098387 fills
 DATA_CACHE_REFILLS_FROM_SYSTEM:
    ALL                       33.508M/sec     113183488 fills
 PAPI_L1_DCA               1833.168M/sec    6192117321 refs
 User time (approx)           3.378 secs    7768991175 cycles  82.1%Time
 Average Time per Call                        0.000027 sec
 CrayPat Overhead : Time       5.1%
 D1 cache hit,miss ratio (R)  92.5% hits          7.5% misses
 D1 cache utilization         13.37 refs/refill  1.671 avg uses
 D2 cache hit,miss ratio      75.7% hits         24.3% misses
 D1+D2 cache hit,miss ratio   98.2% hits          1.8% misses  
 D1+D2 cache utilization      54.71 refs/miss    6.839 avg uses 
 System to D1 refill         33.508M/sec     113183488 lines
 System to D1 bandwidth    2045.156MB/sec   7243743250 bytes
 L2 to Dcache bandwidth    6326.061MB/sec  22406296763 bytes
========================================================================
      

Using SCALASCA for Performance Measurement

SCALASCA is an open source profiling tool. It is designed to monitor the communication patterns of parallel applications. Hence it can identify inter-process inefficiencies such as unnecessary “Wait States” (occurring, for example, as a result of unevenly distributed workloads), or “Late Senders” (where early receives have to wait for sends to be initiated).

Hardware counter metrics information can also be obtained, but not derived metrics. For more information visit SCALASCA .

References

Thu Aug 1 09:05:29 BST 2013