A simple solution to a big data problem

18 Feb 2020
Dr Ati Taherian Fard at her desk at AIBN.

If your data is too big for a high-performance computer, try a different HPC.

That’s what UQ researcher Dr Ati Taherian Fard did and saved herself some valuable research time in the process.

Ati, a Postdoctoral Research Fellow in Associate Professor Jessica Mar’s Group at the Australian Institute for Bioengineering and Nanotechnology (AIBN), visited UQ’s Hacky Hour last year as she had problems running her jobs on RCC-managed HPC Awoonga.

She needed to process a large file and on Awoonga, she had to cut the data into smaller bits as it was more than the 1 TB the HPC allows within its limits of up to 30 days or 90 days.

At Hacky Hour, a weekly eResearch support meetup for researchers, RCC eResearch Analyst and HPC specialist Dr Marlies Hankel suggested Ati try FlashLite, RCC’s data-intensive HPC.

While Awoonga, a conventional HPC, is great for jobs with relatively low data input/output (I/O) requirements, FlashLite is more suited to applications that directly manipulate large amounts of data and require large main memories.

“Ati can now use the flash drives on FlashLite, which allow more than 4 TBs of space, avoiding the need to copy data, and enables her to process complete data files,” said Marlies.

Thanks to FlashLite, Ati was able to transfer, pre-process and process her data, and apply different statistical tests on it, in a timely manner and without having to subset her data into smaller pieces. 

“FlashLite makes data pre-processing and statistical analysis much faster and efficient,” she said.

Ati, a bioinformatics researcher, is investigating the effect of gene expression variability in stem cell and aging using single-cell transcriptomics data. 

“The main aim of this project is to study the heterogeneity in single cell subpopulations during senescence [biological aging], identify genes that are the key drivers of this process and study their functional implications,” she said.

“Single cell RNA sequence data are usually very large and require a lot of memory to process and analyse. My data set is a numerical matrix with more than 100,000 columns and 50,000 rows to start with. Using FlashLite, I am able to process the whole matrix in one go and compute the output in less than 24 hours.”

Marlies helped Ati set up her FlashLite account and requested the best allocation for her data. “Overall, she was very informative and helpful in answering my FlashLite-related questions,” said Ati.

Having undertaken one of RCC’s monthly hands-on HPC workshops prior to using Awoonga, Ati had already learnt how to run her code and install all its dependencies through shell scripts on UQ’s clusters and had gained links to online resources for further self-studies. 

The research project she is involved in, is still in its early stages. Samples were sequenced and the data was received in late 2019. The project team is currently in the process of analysing the data.

It is hoped the project will result in a better understanding of the molecular mechanism driving senescence. Identifying the genes driving this process and also the cell-to-cell differences between young and old at a higher resolution, enables early detection of age-related diseases, and through early interventions, prolongs healthy ageing.

The project is being conducted at AIBN, with Associate Professor Jessica Mar as the Principal Investigator and in collaboration with Professor Ernst Wolvetang’s Group. 

UQ researchers, if you’d like to discuss your HPC needs, please contact the RCC Support Desk.
 

Summary

Researcher:
Dr Ati Taherian Fard
Postdoctoral Research Fellow
AIBN, UQ

Research community:
Genomics, Bioinformatics and Statistics

FlashLite usage:
1 node
24 CPUs
500 GB memory for data analysis.

Latest