Abstract
Comet is the San Diego Supercomputer Center’s most recent NSF-funded (National Science Foundation) supercomputer. It is a nationally allocated resource and was the first of the NSF-funded “supercomputers” to support virtualised high-performance computing.
Comet itself has 1944 dual-socket (24 core) compute nodes, 36 x 2GPU/node (K80) and recently-added 72 x 2GPU/node (P100) all connected via Infiniband (4:1 oversubscription) and 10GbE for administration.
This talk is about what’s underneath the system — software, handling virtual clusters, I/0 configuration, networking (both Ethernet and Single Root I/O virtualized Infiniband), and a nod to achievable performance. We’ll go “into the trenches” and talk about how virtual clusters are presented to the user, what this has required in software development, and where we’ve seen “interesting” performance issues. In particular, we’ll talk about the idea that virtual clusters for HPC should look as much like the physical cluster and this includes retaining disk state for virtualised compute nodes and being able to install a virtual cluster with no pre-existing “magic” system image.
Comet’s Infiniband network is bridged to a large Ethernet fabric with 72 x 40GbE in eight channel groups. We’ll dive into this networking configuration and some of the issues we’ve had to contend with to overcome shortcomings of “advertised capability” vs. “in the field reality”.
We’ll conclude the talk with a perspective on the future of virtualisation in HPC including how full virtualisation and container virtualisation complement each other.
Presentation slides (1 MB, PDF)
About the Speaker
Dr Philip Papadopoulos received his PhD in 1993 from UC Santa Barbara in Electrical Engineering.
He spent five years at Oak Ridge National Laboratory as part of the Parallel Virtual Machine (PVM) development team.
He came to the University of California, San Diego (UCSD) as research professor in computer science in 1998 and still holds and adjunct appointment.
He is currently the Chief Technology Officer at the San Diego Supercomputer Center (SDSC). He is the architect of the NSF-funded Comet Cluster which will support high-performance virtual clusters.
In addition to duties at SDSC, his research interests revolve around distributed, clustered, and cloud-based systems and how they can be used more effectively in an expanding bandwidth-rich environment.
Dr Papadopoulos is a key investigator for key research projects at UCSD including the National Biomedical Computation Resource (NBCR), and the Pacific Rim Applications and Grid Middleware Assembly (PRAGMA, OCI-1234983).
He is well known for leading the development of the open-source, NSF-funded Rocks Cluster toolkit (OCI-0721623), which has an installed base of 1000s of clusters. Rocks (www.rocksclusters.org) is used for both research and production systems with scalability to 1000s of nodes.
He is also the principal investigator for Prism@UCSDPRISM@UCSD: A Researcher Defined 10 and 40Gbit/s Campus Scale Data Carrier (NSF:ACI-1246396). Prism is deployed at UCSD and serves about 12 data-intensive labs on the UCSD campus.
Dr Papadopoulos occasionally teaches in the computer science department. He is an avid hiker.