max - avg
gives the amount of time to be saved by balancing load fairly.
In an MPI application some cores may be held up because of time spent
waiting for messages to arrive, or time spent waiting
at barriers.
It is always important to consider whether communications or barriers
are really necessary.
Good practice is to monitor who communicates with who, the amount of
data transferred, the time
taken for messages to arrive once sent (message latency) and always time
barriers.
Such information points to where attention should be focused when making
your parallel code fairer.
MPI_WTIME
,
which is sufficient for most timing measurements. The other option is to use profiling tools such as CrayPAT, Scalasca, TAU or HPCToolkit. All are available on HECToR and can provide valuable information for the performance of your application,
reporting useful metrics related to the efficiency of your communication patterns, wait-synchronization states etc. CrayPAT, Scalasca and TAU take the
approach of instrumenting executables, however there can also be an overhead associated with using these tools, particularly when tracing experiments are performed. Even for HPCToolkit which performs sampling experiments the overhead can be up to 5%.
CPU_TIME(REAL
(*))
.
Wall-clock time may be measured by calling the MPI function MPI_WTIME
,
which
returns a double precision timing value
(note that MPI_WTIME
is not globally synchronised on
HECToR).
These timing routines should be placed around the parts of your program
that are required to get the
type of timings discussed in the section above, What to
measure.
For example, it is good practice to measure time taken by solvers,
matrix assembly, I/O, MPI routines
(sends/receives/barriers/collectives).
In each case this may mean timing library routines.
It may not be necessary or practical to measure each pass through a
section of code (in particular, it is
not a good idea to put timing routines in tight loops), so try to target
specific calls.
An example of using the MPI_WTIME
function is given below.
dt1 = MPI_WTIME() do i=1,n ! perform computation... end do dt2 = MPI_WTIME() ! dt = time spent in loop dt = dt2 - dt1It is always important to take timings more than once before making assumptions about your code because a single timing result may be anomalous. It is also important to be consistent in which value you use (e.g. mean, median, maximum, minimum) for comparisons. In the case that a lot of different sections of code are being timed it is good practice to manage results in wrapper routines. For example, define two routines
timer_start(event_id)
and timer_stop(event_id)
.
In these routines we can store the time for each event, the total number
of events, and calculate
other useful values like the mean time for an event or the cumulative
time for each process.
These routines can also handle the task of associating events with
meaningful names so that, for example,
we remember the reason for our measurements in a few months time.
Also, timing data should be stored in memory as it is collected with
periodic processing to reduce the size
of data structures.
Results should not be printed as collected because this will likely
present you with too much information
to digest, and the cost of printing adds an overhead.
Printing should be done periodically, or even only at the end of a run,
before MPI_FINALIZE
.
pat_build
to collect as much or as little timing information as you like, such as user routines and specific library routines (e.g. MPI,
I/O). It is also possible to use the CrayPAT API to time specific sections of code in a similar way to the discussion above about using your own timing routines. However,
most useful is the information that CrayPAT can provide but which is very difficult to collect otherwise, such as hardware counter information and in particular derived metrics.
Furthermore information on MPI messages, synchronization and imbalance will help resolve scaling issues, when the bottleneck of the code are communication or synchronization.
This section does not discuss how to use CrayPAT, only what to look for
once you have your results.
A guide to using CrayPAT is given in the User Guide to the HECToR Service: Tools
(also see the
CrayPAT
user guide).
We will look at the output for three pre-defined CrayPAT hardware
counter groups that give useful values
for some of the metrics discussed in the section above What
to measure.
When the environment variable PAT_RT_HWPC
is set to 1
CrayPAT will report hardware
counters that give a good overview of performance, as efficient use of
cache and TLB (Translation Lookaside Buffer) cache have a crucial
impact on the application's performance on HECToR.
The sample output below shows the four hardware counters used in this
group: PAPI_L1_DCM
(level 1 data cache misses),
PAPI_TLB_DM
(data translation lookaside buffer misses), PAPI_L1_DCA
(level 1 data
cache accesses) and PAPI_FP_OPS
(the number of floating
point operations).
More useful than these raw data are the derived values lower down the
table (highlighted).
For example, HW FP Ops / User time
gives a measure of FLOPS
(DP = double precision) and a rating of how this compares to
the theoretical peak, 9.2GFLOPS.
As a guideline, Cray suggest that anything between 10% and 20% is making
good use of the machine.
Anything above this is very good, but anything below should be
optimised.
======================================================================== USER ------------------------------------------------------------------------ Time% 21.8% Time 2.313276 secs Imb.Time -- secs Imb.Time% -- Calls 0.079M/sec 154995.0 calls PAPI_L1_DCM 11.326M/sec 22219097 misses PAPI_TLB_DM 0.001M/sec 2230 misses PAPI_L1_DCA 1601.846M/sec 3142421662 refs PAPI_FP_OPS 3335.765M/sec 6543937895 ops User time (approx) 1.962 secs 4512025000 cycles 84.8%Time Average Time per Call 0.000015 sec CrayPat Overhead : Time 8.9% HW FP Ops / User time 3335.765M/sec 6543937895 ops 36.3%peak(DP) HW FP Ops / WCT 2828.861M/sec Computational intensity 1.45 ops/cycle 2.08 ops/ref MFLOPS (aggregate) 133430.62M/sec TLB utilization 1408889.19 refs/miss 2752 avg uses D1 cache hit,miss ratios 99.3% hits 0.7% misses D1 cache utilization (M) 141.43 refs/miss 17.679 avg uses ========================================================================Computational intensity is calculated as the number of operations per data reference. This value should be around 1 or greater, which implies that the functional units are working on data in registers. TLB utilization (avg uses) should be several hundreds (the larger the better). This is a measure of how many memory references resulted in a TLB hit per miss. D1 cache hit ratio should be close to 100%. D1 cache utilization (M) should be at least 8 average uses for double precision values and at least 16 for single precision. These figures match the length of a cache line and therefore imply that each value in a cache line is on average being referenced at least once.
More specific cache information, including details about level 2 cache, are given in hardware group 2. Of specific interest in the table below are the combined level 1 and level 2 cache hit ratio ( D1+D2 cache hit,miss ratio) and the combined level 1 and level 2 utilization. The total cache hit ratio is useful because the L2 cache serves as a victim cache for L1. If this value is not close to 100% then your program is not making best use of the memory hierarchy.
======================================================================== USER ------------------------------------------------------------------------ Time% 29.6% Time 4.114811 secs Imb.Time -- secs Imb.Time% -- Calls 0.046M/sec 155035.0 calls REQUESTS_TO_L2:DATA 137.888M/sec 465761630 req DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: L2_EXCLUSIVE:L2_SHARED 103.646M/sec 350098387 fills DATA_CACHE_REFILLS_FROM_SYSTEM: ALL 33.508M/sec 113183488 fills PAPI_L1_DCA 1833.168M/sec 6192117321 refs User time (approx) 3.378 secs 7768991175 cycles 82.1%Time Average Time per Call 0.000027 sec CrayPat Overhead : Time 5.1% D1 cache hit,miss ratio (R) 92.5% hits 7.5% misses D1 cache utilization 13.37 refs/refill 1.671 avg uses D2 cache hit,miss ratio 75.7% hits 24.3% misses D1+D2 cache hit,miss ratio 98.2% hits 1.8% misses D1+D2 cache utilization 54.71 refs/miss 6.839 avg uses System to D1 refill 33.508M/sec 113183488 lines System to D1 bandwidth 2045.156MB/sec 7243743250 bytes L2 to Dcache bandwidth 6326.061MB/sec 22406296763 bytes ========================================================================