Machine Learning
Given the growing influence of Machine Learning (ML) on the research sector, RCC has dedicated significant resources to provide infrastructure and operational expertise to support this important field.
In terms of infrastructure, RCC has provided large, powerful GPU clusters with high-performance storage that are ideal for ML workloads. When purchased, these clusters incorporate a mix of best available technology to meet UQ’s research goals. Notable features relevant to ML on these clusters include:
Wiener
The Wiener cluster is powered by NVIDIA’s Tesla V100 GPUs. It comprises of 17 nodes containing 2 GPUs and 15 nodes containing four GPUs. These GPUs can be combined using sophisticated hardware and system software to provide very large computational capacity for ML tasks.
The software stack consists of various version of the CUDA toolkit and the Anaconda platform. This gives users the flexibility to deploy various ML frameworks of their choosing.
The interconnection of GPUs is supported by NVIDIA’s high-performance protocol and backed by a high-performance interconnect. This is configured in the clusters software stack and is transparent to users. Wiener also utilises a high-performance storage system critical for large ML learning tasks.
Bunya
Bunya is RCC’s latest supercomputer that will provide more ML capability than Wiener as its GPU capacity is expanded. Currently Bunya provides:
The latest NVIDA h100 and a100 GPUs which are very powerful GPUs in terms of computational power and memory capacity.
In addition to NVIDIA GPUs; AMD’s ROCm based GPUs have been added. This allows the use of technology from another major vendor in the GPU space.
CPU only nodes. Often in ML, data staging requires considerable processing requiring only the CPU, RAM and storage resources. It is more efficient to do these tasks on dedicated CPU nodes. This frees GPU nodes for tasks that require GPU resources.
Bunya also has an up-to-date software stack in terms of kernel versions, compilers, etc.