Machine Learning
Given the growing influence of Machine Learning (ML) on the research sector, RCC has dedicated significant resources to provide infrastructure and operational expertise to support this important field.
In terms of infrastructure, RCC has provided large, powerful GPU clusters with high-performance storage that are ideal for ML workloads. When purchased, these clusters incorporate a mix of best available technology to meet UQ’s research goals. Notable features relevant to ML on these clusters include:
Bunya
Bunya provides more ML capability than previous (and decommissioned) HPC Wiener as its GPU capacity is expanded. Currently Bunya provides:
The latest NVIDA h100 and a100 GPUs which are very powerful GPUs in terms of computational power and memory capacity.
In addition to NVIDIA GPUs; AMD’s ROCm based GPUs have been added. This allows the use of technology from another major vendor in the GPU space.
CPU only nodes. Often in ML, data staging requires considerable processing requiring only the CPU, RAM and storage resources. It is more efficient to do these tasks on dedicated CPU nodes. This frees GPU nodes for tasks that require GPU resources.
Bunya also has an up-to-date software stack in terms of kernel versions, compilers, etc.