AVS/Express currently uses the open source Paracomp  compositing library, initially developed by HP. The library requires dynamic linking (it has a framework that requires the use of dlopen()) and supports tcp/ip, InfiniBand and Mellanox network layers. It also uses multiple pthreads for networking, control and image operations. The basic compositing method employed is the Scheduled Linear Image Compositing  method. While this has proved to be an effective compositing library on render clusters that have InfiniBand or Mellanox networking, the lack of MPI support, coupled with the use of multiple pthreads and dynamic linking, resulted in our decision to remove it from AVS/Express on HECToR. We have compiled the library to provide just the core image compositing routines. This can be compiled statically and provides a method of compositing two images together, using either depth testing or alpha blending. This is the fundamental image operation required of any compositor.
Having removed the Paracomp communication facilities we have implemented the 2-3 Swap Image Compositing method . This allows all image communication between render processes (which perform the compositing operations) to take place within the Cray MPI layer. The 2-3 Swap method is similar to Binary-Swap Compositing  but removes the need to have a power-of-2 number of render processes. This is important in AVS/Express due to the use of the MPI proxy process.
Parallel image compositing reduces the time required to blend the rendered images of the sub-domains (recall the dataset is divided in to sub-domains of data) from every render process. Sending full sized images from all render processes directly to one process for blending (using either depth testing or alpha blending) would introduce a bottleneck at that process. By dividing images at every process in screen-space (i.e., discarding rows of pixels) and exchanging these sub-images, all processes can take part in the blending operation. Eventually every process will have a sub-image that contains a fully rendered dataset. The final step is to gather these sub-images at a single process and copy them in to the image buffer. This final gather step is not such a bottleneck because the screen-space sub-images are small at this stage (each sub-image contains of the total number of pixels in the final image, when is the number of render processes). All of this communication takes place on the backend nodes. We then have to send the final image through the MPI proxy process to the express process so that it can display it in the user interface.