DRAFT

Good Practice Guide

Preparing for HECToR Phase IIb (24-core)

Appendix: some micro-benchmarking results

Back to main guide

Context
Memory Benchmarking Result
Inter-node Communication Benchmark Result

These results are intended to help explain the NUMA nature of an XT6 node. They should not be taken to be official benchmarking figures and the performance of individual codes will be subject to influences other than the characteristics of the architecture intended to be illustrated by these results.

Memory Benchmark Result

The test uses 24 OpenMP threads running on each of the cores separately within a node to perform 24 3000x3000 matrix multiplications.

Thread placement is such that threads 0-5 run on die 0 in processor 0, 6-11 on die 1 in processor 0, 12-17 on die 0 in processor 1 and 18-23 on die 1 in processor 1.

The arrays used to store the matrices are initialised in parallel. By doing this the data that is initialised by a thread will be stored in the local memory attached to the die housing the core used to run the thread.

The test then performs 24 matrix multiplications in parallel using, for each thread, the data initialised locally (this takes advantage of thread affinity) and reports the time taken for each:

Thread ID Data Init Thread ID Time

16 16 23.129

19 19 23.133

21 21 23.135

7 7 23.135

12 12 23.135

15 15 23.137

20 20 23.137

18 18 23.139

9 9 23.139

22 22 23.139

6 6 23.141

13 13 23.141

10 10 23.143

3 3 23.143

8 8 23.144

11 11 23.144

23 23 23.146

17 17 23.146

14 14 23.150

4 4 23.150

2 2 23.150

5 5 23.152

1 1 23.156

0 0 23.161

Figure 2: 24 matrix multiplications in parallel on data initialised locally.

Thread ID	Data Init Thread ID	Time
16	16	23.129
19	19	23.133
21	21	23.135
7	7	23.135
12	12	23.135
15	15	23.137
20	20	23.137
18	18	23.139
9	9	23.139
22	22	23.139
6	6	23.141
13	13	23.141
10	10	23.143
3	3	23.143
8	8	23.144
11	11	23.144
23	23	23.146
17	17	23.146
14	14	23.150
4	4	23.150
2	2	23.150
5	5	23.152
1	1	23.156
0	0	23.161

Then the test again does 24 matrix multiplications, but not all threads work on original local data. Within each die the lowest numbered 3 threads work on their original local data; the fourth thread works on the data previously used by the fourth thread in the neighbouring die within the same processor; the fifth thread works on the data previously used by the fifth thread in the counterpart die in the neighbouring processor; the sixth thread works on the data previously used by the sixth thread in the diagonally opposite die in the neighbouring processor. Thus, the highest numbered 3 threads in each die will need to retrieve data from memory over the 24-bit, 16-bit and 8-bit HT links, respectively. The timings are as follows:

Thread ID Data Init Thread ID Time

12 12 23.126

7 7 23.129

18 18 23.131

20 20 23.132

14 14 23.132

19 19 23.133

13 13 23.134

6 6 23.137

8 8 23.139

1 1 23.141

2 2 23.143

0 0 23.149

3 9 23.429

4 16 23.430

15 21 23.438

21 15 23.442

22 10 23.450

10 22 23.451

9 3 23.466

16 4 23.472

17 11 23.501

5 23 23.505

11 17 23.505

23 5 23.533

Figure 3: 24 matrix multiplications in parallel on data initialised locally (white), in a neighbouring die within the same processor (yellow), in the counterpart die in the neighbouring processor (orange) and in the diagonally opposite die in the neighbouring processor (red).

Thread ID	Data Init Thread ID	Time
12	12	23.126
7	7	23.129
18	18	23.131
20	20	23.132
14	14	23.132
19	19	23.133
13	13	23.134
6	6	23.137
8	8	23.139
1	1	23.141
2	2	23.143
0	0	23.149
3	9	23.429
4	16	23.430
15	21	23.438
21	15	23.442
22	10	23.450
10	22	23.451
9	3	23.466
16	4	23.472
17	11	23.501
5	23	23.505
11	17	23.505
23	5	23.533

The results show a clear performance penalty when working on non-local data. As expected, working on data held by a diagonally opposite die shows the worst performance, but the difference between working on data held by a neighbouring die within the same processor and data held by a counterpart die in the neighbouring processor is negligible.

Inter-node Communication Benchmark Results

The test uses 48 MPI processes distributed among two XT6 nodes. It is designed to measure the effect of a single shared 6.4GB/s HT link to the interconnect per node and whether the hierarchy of HT links within a node influences communication times for messages sent between the two nodes.

Figure 4 shows the setup of the test and labels dies 0-7. Default process placement is used, which means that MPI rank k runs on core i of die j, where k=6*j+i (i.e. processes are placed on dies following the numbering indicated in Figure 4). The program sets up a communication pattern such that the processes in ONE pair of corresponding dies in the two nodes ping pong data to each other, and all other processes perform self-to-self communication (which simply involves accessing local memory with Cray's MPI library). The message size is 536MB and the time for 10 consecutive 536MB ping pongs is recorded. The program performs an MPI_Isend, and since the message size is sufficiently large that the long messaging protocol will be used, the message is buffered on the receive side waiting to be transferred via the interconnect when the matching MPI_Recv is called. The cumulative time for 10 MPI_Recvs is taken (and the MPI_Isends are not timed). These tests ensure a constant amount of contention on the nodes' HT links to the interconnect in each test, although since the interconnect is a shared resource it will introduce some variability.

The performance results are given in Figure 5. Each block of red and yellow bars represents a single run of the program with ping pong communication between processes in a pair of corresponding dies, (except the first block which is the control situation: all processes perform self-to-self communication).

Each red/yellow bar applies to a die. It is the average time taken to receive a message for the 6 cores in the die. The error bar shows the standard deviation of the 6 timings. Within each block there are 8 dies shown in increasing order using the numbers indicated in Figure 4. Thus, red bars show the time taken for processes in node 1 and yellow bars for processes in node 2. The large white bar in each block except the first represents the average time for off-node communication for the test. The blocks are placed on the x-axis in decreasing order of available on-node bandwidth for the off-node communication pattern. For example, in the 3<->7 test data must travel via a 8-bit diagonal HT link in both the source and destination nodes, but for the 0<->4 test data is immediately passed out through the HT link to the interconnect.

Figure 4: Two nodes with labelled dies connected via the interconnect.

Figure 5: Inter-node communication performance.

The second block (1<->4) shows that off-node communication between processes in dies 0 and 4 costs roughly twice as much as self-to-self, in memory communication. The other dies are unaffected by the off-node communication in this test.

In the third block (1<->5), even though dies 0 and 4 are engaged in self-to-self communication only, they appear to be affected by the traffic from dies 1 and 5 passing through the attached HT links to the interconnect. Other dies are unaffected by off-node traffic. This pattern appears again in the final 2 tests (2<->6 and 3<->7): dies 0 and 4 are affected by off-node traffic even though they are engaged in self-to-self communications.

The large white bars do tend to increase going along the x-axis, which suggests that the hierarchy of HT links within a node has an effect on off-node communication, although the differences between the bars are small in this test.

DRAFT

Good Practice Guide

Preparing for HECToR Phase IIb (24-core)

Appendix: some micro-benchmarking results

Back to main guide

Contents

Context

Memory Benchmark Result

Inter-node Communication Benchmark Results

Back to main guide