RCC and other UQ staff are working to make it easier for the university’s researchers to use the Amazon cloud.
The project team recently completed an experiment bridging UQ’s data infrastructure in Brisbane with Amazon’s cloud data centre in Sydney.
The project tested the idea of running a High Throughput Computation (HTC), in this case, a Genome Wide Association Study (or GWAS) using Amazon’s Elastic Compute platform (EC2), while accessing data collections in QRIScloud in the Polaris Data Centre.
Many UQ researchers use QRIScloud for data storage, sometimes unknowingly as UQ’s data storage fabric, MeDiCI (Metropolitan Data Caching Infrastructure), seamlessly transfers data from Polaris and QRIScloud to a user’s device without the user being aware of the data movement.
While not optimal for traditional HPC workloads, cloud computing platforms, such as Amazon’s, are well suited to HTC in which many sequential jobs are executed as quickly as possible. This is because this application class doesn’t require very fast interprocessor networking, and thus platforms such as EC2, which have relatively low performance and high latency internal networks, can execute the workload in a similar way to UQ’s HPC clusters.
In this case, the project team chose to run a GWAS for Dr Allan McRae, from UQ’s Institute for Molecular Bioscience.
GWAS are hypothesis-free methods to identify associations between regions of the genome and complex traits and disease.
Dr McRae's study aimed to test how genetic variation alters DNA methylation, an epigenetic modification that controls how genes are expressed. This analysis was performed on data from the Systems Genomics of Parkinson's Disease consortium, which has collected DNA methylation data on about 2,000 individuals.
The results from this analysis are being used to understand the biological pathways through which genetic variation affects disease risk. The work is performed using the Plink software.
RCC staff created a compute cluster, not unlike the Awoonga cluster in Polaris, in EC2 using locally-developed Ansible scripts. These scripts make it relatively easy to build and configure a remote cluster repeatedly, thus the system is only deployed when it is required, and shut down after the experiment is complete. This is important in a commercial cloud because Amazon charges per second even if the nodes are idle.
A novel aspect of this project is the way the team “connected” the EC2 resources to QRIScloud’s data collection storage. To do this, they extended MeDiCI and built a special purpose “virtual” MeDiCI cache node in EC2.
The cache provided local access to data even though it was actually stored in QRIScloud, and the necessary files were fetched transparently on demand. Likewise, output files were written back to QRIScloud without the user being aware. This meant the application didn’t have to be modified, and was identical to as if it was executed on a UQ cluster.
Another novel aspect of the project was the use of the RCC Nimrod job scheduler. Nimrod, a specialised parametric modelling system, makes it very easy for a user to create and execute high throughput experiments, simply by developing a small description containing instructions on how to run the code, and what parameters should be explored.
Nimrod was able to execute the 500,000 jobs efficiently, spreading the load across the compute nodes and completing the work in three days.
Overall, approximately 60 TBs of data were generated by the experiment and sent back to the Polaris Data Centre for long-term storage and post-processing.
In technical terms, 750 Xeon Skylake cores were used in the experiment for just over three days. System utilisation was on average about 85–92 per cent on each node, with the I/O peaking at about 420,000 input and 25,000 output operations per second (IOPS).
RCC wishes to thank Amazon for their generous donation of computing credits.
If UQ researchers would like to test the system on their research projects, they are invited to contact RCC to discuss the options available: rcc-support@uq.edu.au.