max - avg gives the amount of time to be saved by balancing load fairly.
In an MPI application some cores may be held up because of time spent
waiting for messages to arrive, or time spent waiting
at barriers.
It is always important to consider whether communications or barriers
are really necessary.
Good practice is to monitor who communicates with who, the amount of
data transferred, the time
taken for messages to arrive once sent (message latency) and always time
barriers.
Such information points to where attention should be focused when making
your parallel code fairer.
MPI_WTIME,
which is sufficient for most timing measurements. The other option is to use profiling tools such as CrayPAT, Scalasca, TAU or HPCToolkit. All are available on HECToR and can provide valuable information for the performance of your application,
reporting useful metrics related to the efficiency of your communication patterns, wait-synchronization states etc. CrayPAT, Scalasca and TAU take the
approach of instrumenting executables, however there can also be an overhead associated with using these tools, particularly when tracing experiments are performed. Even for HPCToolkit which performs sampling experiments the overhead can be up to 5%.
CPU_TIME(REAL
(*)).
Wall-clock time may be measured by calling the MPI function MPI_WTIME,
which
returns a double precision timing value
(note that MPI_WTIME is not globally synchronised on
HECToR).
These timing routines should be placed around the parts of your program
that are required to get the
type of timings discussed in the section above, What to
measure.
For example, it is good practice to measure time taken by solvers,
matrix assembly, I/O, MPI routines
(sends/receives/barriers/collectives).
In each case this may mean timing library routines.
It may not be necessary or practical to measure each pass through a
section of code (in particular, it is
not a good idea to put timing routines in tight loops), so try to target
specific calls.
An example of using the MPI_WTIME function is given below.
dt1 = MPI_WTIME()
do i=1,n
! perform computation...
end do
dt2 = MPI_WTIME()
! dt = time spent in loop
dt = dt2 - dt1
It is always important to take timings more than once before making
assumptions about your code
because a single timing result may be anomalous.
It is also important to be consistent in which value you use (e.g. mean,
median, maximum, minimum)
for comparisons.
In the case that a lot of different sections of code are being timed it
is good practice to
manage results in wrapper routines.
For example, define two routines timer_start(event_id) and timer_stop(event_id).
In these routines we can store the time for each event, the total number
of events, and calculate
other useful values like the mean time for an event or the cumulative
time for each process.
These routines can also handle the task of associating events with
meaningful names so that, for example,
we remember the reason for our measurements in a few months time.
Also, timing data should be stored in memory as it is collected with
periodic processing to reduce the size
of data structures.
Results should not be printed as collected because this will likely
present you with too much information
to digest, and the cost of printing adds an overhead.
Printing should be done periodically, or even only at the end of a run,
before MPI_FINALIZE.
pat_build to collect as much or as little timing information as you like, such as user routines and specific library routines (e.g. MPI,
I/O). It is also possible to use the CrayPAT API to time specific sections of code in a similar way to the discussion above about using your own timing routines. However,
most useful is the information that CrayPAT can provide but which is very difficult to collect otherwise, such as hardware counter information and in particular derived metrics.
Furthermore information on MPI messages, synchronization and imbalance will help resolve scaling issues, when the bottleneck of the code are communication or synchronization.
This section does not discuss how to use CrayPAT, only what to look for
once you have your results.
A guide to using CrayPAT is given in the User Guide to the HECToR Service: Tools
(also see the
CrayPAT
user guide).
We will look at the output for three pre-defined CrayPAT hardware
counter groups that give useful values
for some of the metrics discussed in the section above What
to measure.
When the environment variable PAT_RT_HWPC is set to 1
CrayPAT will report hardware
counters that give a good overview of performance, as efficient use of
cache and TLB (Translation Lookaside Buffer) cache have a crucial
impact on the application's performance on HECToR.
The sample output below shows the four hardware counters used in this
group: PAPI_L1_DCM (level 1 data cache misses),
PAPI_TLB_DM (data translation lookaside buffer misses), PAPI_L1_DCA
(level 1 data
cache accesses) and PAPI_FP_OPS (the number of floating
point operations).
More useful than these raw data are the derived values lower down the
table (highlighted).
For example, HW FP Ops / User time gives a measure of FLOPS
(DP = double precision) and a rating of how this compares to
the theoretical peak, 9.2GFLOPS.
As a guideline, Cray suggest that anything between 10% and 20% is making
good use of the machine.
Anything above this is very good, but anything below should be
optimised.
========================================================================
USER
------------------------------------------------------------------------
Time% 21.8%
Time 2.313276 secs
Imb.Time -- secs
Imb.Time% --
Calls 0.079M/sec 154995.0 calls
PAPI_L1_DCM 11.326M/sec 22219097 misses
PAPI_TLB_DM 0.001M/sec 2230 misses
PAPI_L1_DCA 1601.846M/sec 3142421662 refs
PAPI_FP_OPS 3335.765M/sec 6543937895 ops
User time (approx) 1.962 secs 4512025000 cycles 84.8%Time
Average Time per Call 0.000015 sec
CrayPat Overhead : Time 8.9%
HW FP Ops / User time 3335.765M/sec 6543937895 ops 36.3%peak(DP)
HW FP Ops / WCT 2828.861M/sec
Computational intensity 1.45 ops/cycle 2.08 ops/ref
MFLOPS (aggregate) 133430.62M/sec
TLB utilization 1408889.19 refs/miss 2752 avg uses
D1 cache hit,miss ratios 99.3% hits 0.7% misses
D1 cache utilization (M) 141.43 refs/miss 17.679 avg uses
========================================================================
Computational intensity is calculated as the number of operations per
data reference.
This value should be around 1 or greater, which implies that the
functional units are working on data
in registers.
TLB utilization (avg uses) should be several hundreds (the larger the
better). This is a measure of how many memory references resulted in a TLB hit per miss.
D1 cache hit ratio should be close to 100%.
D1 cache utilization (M) should be at least 8 average uses for double
precision values and at least 16 for single precision.
These figures match the length of a cache line and therefore imply that
each value in a cache line is on average being referenced at least once.
More specific cache information, including details about level 2 cache, are given in hardware group 2. Of specific interest in the table below are the combined level 1 and level 2 cache hit ratio ( D1+D2 cache hit,miss ratio) and the combined level 1 and level 2 utilization. The total cache hit ratio is useful because the L2 cache serves as a victim cache for L1. If this value is not close to 100% then your program is not making best use of the memory hierarchy.
========================================================================
USER
------------------------------------------------------------------------
Time% 29.6%
Time 4.114811 secs
Imb.Time -- secs
Imb.Time% --
Calls 0.046M/sec 155035.0 calls
REQUESTS_TO_L2:DATA 137.888M/sec 465761630 req
DATA_CACHE_REFILLS:
L2_MODIFIED:L2_OWNED:
L2_EXCLUSIVE:L2_SHARED 103.646M/sec 350098387 fills
DATA_CACHE_REFILLS_FROM_SYSTEM:
ALL 33.508M/sec 113183488 fills
PAPI_L1_DCA 1833.168M/sec 6192117321 refs
User time (approx) 3.378 secs 7768991175 cycles 82.1%Time
Average Time per Call 0.000027 sec
CrayPat Overhead : Time 5.1%
D1 cache hit,miss ratio (R) 92.5% hits 7.5% misses
D1 cache utilization 13.37 refs/refill 1.671 avg uses
D2 cache hit,miss ratio 75.7% hits 24.3% misses
D1+D2 cache hit,miss ratio 98.2% hits 1.8% misses
D1+D2 cache utilization 54.71 refs/miss 6.839 avg uses
System to D1 refill 33.508M/sec 113183488 lines
System to D1 bandwidth 2045.156MB/sec 7243743250 bytes
L2 to Dcache bandwidth 6326.061MB/sec 22406296763 bytes
========================================================================