next up previous contents
Next: Software Up: Hardware and software Previous: Hardware and software   Contents

GPU Highlights

The K20c cards implement the latest Kepler GK110 architecture and provide a few useful capabilities over previous generations. A device process/thread is now capable of launching new grids of threads (branding name: `dynamic parallelism'), up to 32 system processes can now hold contexts on a single device (named `Hyper-Q'). Another important feature is the direct access to device memory through the kernel driver (conveniently called `GPUDirect'). The GPU architecture is significantly different from the traditional multi CPU model.

There are 13 multiprocessors (SMX) on one K20c chip, each supplying 192 single precision cores, 64 double precision units, 32 special functions units and 32 load/store units and large shared dynamic register file. Each core executes instructions in-order without preemptive speculative evaluation (similar to the famed but now discontinued Itanium processors) at rather low frequency hence the strength is only in numbers and low utilisation translates directly to low performance.

The software environment is somewhat abstracted away from the lower level grouping details. The execution starts by launching an, up to 3D, `grid' of again, up to 3D blocks of threads. A thread block can contain up to 1024 threads. Each thread has access to private static memory on one level then to a shared block memory (in configurations of 16, 32 or 48 KB) and on the 3rd level to the global grid memory. The latency and transfer slowdown increase in the same order. A grid can contain up to 231-1 blocks and the management is handled by the system. How many blocks/threads run simultaneously will depend greatly on the static thread memory and static shared memory allocation besides the limitations on the thread and block numbers / multiprocessor. It is worth noting that every 32 threads share a scheduler and grave inefficiently will be incured if there is any divergence in the instructions they execute. Such a bundle is called a `warp' and it is a good idea to allocate block sizes in multiples of these and generally keep in mind. Useful tips and best practices may be found in the official documentation and the numerous examples, guides, and tutorials published by NVIDIA.


next up previous contents
Next: Software Up: Hardware and software Previous: Hardware and software   Contents
DP 2013-08-01