RCC at SC22: Our staff report on the world's largest HPC conference

22 Dec 2022
RCC's Sarah Walters at SC22, her first international conference. (Photo: Jake Carroll, RCC.)

Three RCC staff attended the world’s largest supercomputing conference this year — SC22 in Dallas, Texas, from 13–18 November.

RCC Director Professor David Abramson, Chief Technology Officer Jake Carroll, and Research Computing Systems Engineer Sarah Walters each filed reports about their experiences at this year's International Conference for High Performance Computing, Networking, Storage, and Analysis, AKA SC22.

David is a regular SC attendee and has been to about 20 of the conferences. He has also served on SC program and organising committees many times and was Technical Papers Chair in 2021. Jake has attended six SC conferences, starting in 2015, while SC22 was Sarah’s first international conference.

SC22 welcomed 11,830 attendees and featured 361 exhibit booths on the show floor. 

The annual conference was established in 1988 by the Association for Computing Machinery (ACM) and the IEEE [Institute of Electrical and Electronics Engineers] Computer Society.
 


ACM/IEEE Supercomputing back in the race

By Professor David Abramson, RCC Director

Along with a number of colleagues, I have just returned from the 2022 ACM/IEEE Supercomputing conference (SC22). This annual event usually attracts up to 15,000 attendees and is the major annual supercomputing event globally. This year’s attendance figures were more than 11,000, which indicates that the community is well and truly prepared for a post-pandemic lifestyle!

SC has a huge show floor, with booths from all major vendors and many research labs and institutions. Australia has often been represented on the show floor, and this year NCI Australia from Canberra and the Pawsey Supercomputing Research Centre from Perth shared a booth. It was great to chat to fellow Australians and to see us present at this global event.

Last year I was privileged to attend SC21 to accept the annual Ken Kennedy Award and to give a keynote speech. This year, I was equally privileged to see a number of close colleagues and friends recognised for their work, specifically:

 

See the SC22 website for details about all award winners at the conference.

There were also a terrific number of invited talks from people working at the leading edge of supercomputing and the applications of supercomputing. I encourage people to take a look at the SC22 website and follow up on things they might have missed.

Professor Jack Dongarra's Turing Award speech at SC22. (Photo: Sarah Walters, RCC.)

 


SC22: “I broke my step count record!”

By Sarah Walters, RCC Research Computing Systems Engineer
 

I was delighted to be invited to join David Abramson and Jake Carroll for SC22 in Dallas, Texas.

This is the first international conference that I have attended, so the scale of it was a new experience for me.

More than 11,000 people descended on the Kay Bailey Hutchison Convention Center in Dallas to participate. In the era of COVID, seeing so many people in one place was a bit overwhelming! The global high-performance computing (HPC) community has a huge appetite to get together to share knowledge and meet with their peers, and everywhere you looked there was a flood of people.

There was so much to see at SC22 that I broke my step count record scrambling around trying to cram as much in as possible!

The talk by this year's Turing Award winner Jack Dongarra about his long history in HPC and the development of the TOP500 tests was both entertaining and inspiring. It's incredible to see how far we have come and the challenges that needed to be resolved along the way. And, of course, there are always new problems! Maria Girone, Chief Technology Officer of the CERN openlab, gave an excellent talk on the enormous challenges of managing the stupendous amount of data generated by the Large Hadron Collider.

I really enjoyed the opportunity to see the many problems that users and administrators of high-performance computing and big storage are facing and the solutions that are being developed to tackle these issues.

The issues of portability, reproducibility and consistent-build pipelines have fostered a huge interest in containers. I attended a workshop on the use of containers in HPC, and I'm excited to build some containers for UQ’s new HPC Bunya.

Security is a growing concern in the HPC space. In one presentation, there was a discussion on "Historical DevOps", where analysis of code changes against known security flaws right down the dependency chain allows pre-emptive detection of potential security holes. The Security Workshop on the final day was highly attended and there was vigorous discussion between the panelists and the audience.

Historically, security is something that has been "sprinkled over the top" (a direct quote, during a SC22 panel discussion, from Rickey Gregg of the US Department of Defense’s HPC Modernization Program) in the HPC space — but now the community is increasingly looking into how we can assess risk and design security into our systems from the beginning, without impacting the ability of researchers to run their workloads.

There were so many more talks I was unable to attend, often because there would be two or three or even four relevant talks on at the same time!

There is a huge demand for accelerators on HPCs — you can see that in what the vendors at SC22 presented, in the sheer amount of liquid cooling options on the exhibition floor, and talks about how to share GPUs between multiple people (or containers!) to get the maximum use out of them, and conversely how to run MPI (message passing interface) jobs between GPUs in the way we now run them between CPUs to enable enormous parallel processing.

The real take-home here is that working with our vendors and keeping abreast of emerging technologies is critical to enable the world-class research that is being performed at the University of Queensland. There is a lot to do to keep supporting our users as the data and complexity of research grows.

 


Jake's take: General HPC industry trends at SC22 

By Jake Carroll, RCC Chief Technology Officer
Jake Carroll (right) talking to a vendor on the SC22 show floor. (Photo: Sarah Walters, RCC.)
  • For the first time ever, SC had a dedicated focus on cybersecurity principles in supercomputing. The topic proved so popular that a full-day workshop on these matters for the community was full, on the last day of the conference. SC will run another full day cybersecurity workshop next year given the popularity and focus. 
     

  • Storage is the same challenge it has always been, with ever-increasing outputs from scientific instruments and simulation — but the conversation has now spun to discussion of the automation of how to "throw away" data programmatically and with certainty so that we can manage and be more sustainable. An entire workshop this year at SC was devoted to the automation and instrumentation of how we detect, understand and deal with raw data we do not need as part of our simulations and science.
     
  • We are now seeing Exascale systems commissioned. That brings a new set of benchmark results that are pointing to a different set of efficiency challenges we must chase down. The focus is now on the amount of time spent in the communication inside a system that is creating inefficiency. The big question: How might we fix this? 
     
  • AI and machine learning are still at the front of everyone's minds, and we are now seeing proof of life in AI and machine learning being used in very large traditional HPC simulation to shorten research time-to-discovery and improve result accuracy.  
     
    • In-Situ Solvers using AI and machine learning models appear to be gaining much wider acceptance in the community now. 
       
  • Differentiated architectures (ARM, RISC, x86, GPU-specific) in the same system are becoming easier to work with. Many of the hardware companies are working hard to provide one SDK to rule all architectures so that code will run seamlessly on a CPU, GPU, APU or other accelerators, but without the need to recompile for a specific target. Whether this approach gains widespread acceptance is another matter. 
     
  • Very large wafer technologies as an alternative to traditional CPU and GPU is becoming more realistic. More than one company is now fabricating "wafer scale" processors for artificial intelligence, inference and massive model training, with the promise of being better and more efficient, and faster than traditional GPU approaches. 
     
  • Large studies into the variability of accelerator silicon are being conducted to determine performance variability. What these studies have shown thus far, is that highly variable GPUs stay highly variable, despite different cooling methods, placement in racks and the types of jobs run on them — but they are variable in the GPU cores, not high-bandwidth memory. What this confirms to industry is that subtle differences in the quality of the manufacturing from GPU to GPU exist, but are not enough for them to fail, nor to call the part defective. What this means is we need to consider more carefully how we might schedule jobs to be aware of these subtle differences. 
     
    • This created a whole new domain in scheduler design and in reactivity and feedback loops for how schedulers operate with a deeper awareness of the environment. Such features and capabilities may end up included in schedulers such as SLURM (the scheduler used by UQ HPCs Bunya and Wiener). 
       
  • The rise of chiplet-based architecture for CPU/APU/GPU design.  
     
    • This technique facilitates newer approaches to combining GPU, CPU, high-bandwidth memory and high-performance interconnects onto a single, easily deployable package.  
       
    • It can decrease yield issues and cost in some cases, which was becoming a problem for the industry in more traditional monolithic silicon design. 
       
  • This chiplet approach, combined with the ever-increasing transistor count has the knock-on effect of escalating thermal limits (sometimes called TDP or Thermal Design Point) inside HPC systems, which we must actively account for in our next-generation computing platforms. What we got back in smaller and smaller lithography (down near 4nm and 3nm), we lost in just how much density we can achieve as a result. 
     
  • Liquid cooling systems are now becoming a very popular way of cooling HPCs and will become a mainstay in the future given the thermal limits mentioned above. 
     
  • Composable systems using technology standards, such as CXL, are getting nearer to market (but are still only in technology demo mode). 
     
    • CXL will facilitate the construction of large namespaces, memory sharing, accelerator aggregation, cache, storage, process transparency and memory heterogeneity as a new industry standard. This may drive the cost of dense systems down using building blocks of commodity components, creating more flexible spaces for various types of workloads but not with the premium that custom silicon once may have cost to achieve it. 

Latest