Overcoming computational barriers: the search for gene - gene interactions in colorectal cancer
University of Edinburgh Colon Cancer Genetics Group (CCGG)
EPCC, University of Edinburgh
Image of a normal colon.
National Cancer Registration data indicate that some 35,000 people each year are diagnosed with colorectal cancer (cancer of the large bowel and rectum) and 16,000 die from the disease. Excluding skin cancer, this makes it one of the most common forms of cancer in the country in both men (after prostate and lung cancer) and women (after breast cancer). While the development of effective treatments is clearly important, early identification of patients at risk and prevention is a primary objective of all major cancer agencies and of National Health Service policy.
Armed with first access to an unprecedented set of genomic data in colorectal cancer, the University of Edinburgh Colon Cancer Genetics Group (CCGG) and EPCC Supercomputing Centre teamed up to investigate the relationship between genetic markers and colorectal cancer. Following on from a previous project which examined individual genes, the current study looks at gene - gene interactions (GxG) as a possible contributor to colorectal cancer risk.
The scale of the programme is substantial: it aims to use a significant portion of the largest genotypic data set for large bowel cancer that has been compiled anywhere in the world to date: a unique and extensive set of 560,000 genetic markers with real data from 1000 cancer cases and 1000 matched controls. The analysis software calculates the probability of an interaction by chance for all pairs of markers. To calculate these probability values, every single marker needs to be compared to every other marker: a total of 150 Billion comparisons! On a standard PC, this analysis, using the existing software, would have taken about 400 days and required over 3TB of memory and hard disk space. Clearly, this is not practically feasible: the way forward was to optimise and parallelise the code, spreading the gene marker comparisons across multiple processors and hence reducing the calculation time.
This work used three different machines in a complementary fashion: HECToR was chosen for the main analysis because of its large scale computational capability. A local parallel cluster (with similar processors to HECToR) was used for development. The sorting of the result data was not computationally demanding, but did require access to large amounts of memory, and hence was well suited to HPCx.
First, serial optimizations were performed resulting in a three fold speed-up, and the code was modularised and thoroughly tested. Then, the code was parallelised using a 2D decomposition to split the data into manageable "chunks". The size of a chunk was designed to fit the memory requirements of a single processor. A task farm approach was used to distribute the chunks to all parallel processors on a "first come, first served" basis, since the individual processors do not need to exchange any data during processing. The resulting analysis took approximately 5 hours on HECToR (using 512 cores): a vast improvement on 400 days!
The resulting 200GB of output data then needed to be sorted in order to rank those gene markers as to which have the highest probability to interact with each other. Similar to the analysis itself, sorting such a large amount of data on a single PC was not feasible. A parallel sorting algorithm was identified, developed and run on HPCx to perform this task. The sorted data is now undergoing further study, with results expected Q4 2009.
This project has enabled the exploration of new territory for genetic marker analysis in colorectal cancer, and plans are underway to enhance this study by analysing even larger datasets.
Acknowledgements
The primary research work, including patient recruitment and genetic analysis, is funded by Cancer Research UK, the Medical Research Council (MRC), the Scottish Executive and CORE. EPCC acknowledges the help of the HPCx support team. The CCGG is a University of Edinburgh research group based at the MRC Human Genetics Unit at the Western General Hospital in Edinburgh.