Part 6: Creating a Sustainable Reality for HPC.

Where can you optimise for sustainable HPC?

This is part four of a six-part series on what sustainability means for the High Performance Computing Community.

In the world of innovation, efficiency stands as the cornerstone of progress. When diving into the realm of sustainable outcomes for supercomputing it makes sense to find every means possible to optimise.  While we’ve already explored the potential behind power and cooling efficiency, how you run the system itself matters just as much.  So what’s the magic formula?  According to Dan Shaw of Alces Flight, all it takes is “Five I’s:” Identifying, Investigating, Informing, Improving and Iteration.  These key components are what can make an HPC service go from ordinary to extraordinary. 

We were very pleased to host our team member, Dan Shaw, at our Sustainable Reality event at the National Space Centre on September 25, 2023.  Here’s our highlights from his talk where he shared insights and practical tips on approaching optimisation in HPC.

Identifying optimisation opportunities means an eye to analysis.

Key in Dan’s work as Head of Engineering is monitoring for issues which are producing drag on HPC services.  One of the key concerns that he sees across his cohort of systems is ‘idle compute,’ where compute requests take over nodes but do nothing with them – costing both time and money for the client.  Monitoring systems appropriately is key for teasing out this common problem, and is something built into the managed services Alces Flight provides.  

“Monitoring for issues like idle compute is something that can be easily packaged into basic managed services plans,” Dan noted, “it is a key reason why we’ve seen a rise in requests for standard service offerings as part of integration.”

Investigate options to ensure maximum efficiency.

Where his work in managed services aligns to the team on the ground is seeking out the right channels for resolution.  Utilising the data at hand Dan and his team target the areas of concern and come up with potential solutions.  Having options available means that when the time comes to engage directly with the customer they aren’t wasting time on validating the issue as they have everything they need.

“Having everything ready when looking to inform and improve upon issues is something that is beyond just what we do as engineers,” Dan noted, “It is simply best practice when working in collaboration to bring as much as you can to the table to hopefully bring about quick resolution – and most importantly – greater efficiency.”

Informing and improving on issues to drive sustainability

Once engaged with a client it is all about making sure the connections to the project causing efficiency drain, as well as potential resolves, are clearly communicated.  “Initially you look for resolutions that will give immediate relief,” Dan noted, “So in the case of idle compute things such as getting resource selection improved, minimising the idle compute, and optimising workflow can be done relatively quickly.  But to get to the improvement stage you need to keep good records on basic items such as these so that at the point of hardware or resource refresh you don’t find yourself running right back into these same issues.”

To track a system as it grows and evolves over time Dan’s team has been integral in the development of Alces Flight Centre toolset. Alces Flight Center serves as a comprehensive suite for maintenance and efficiency checks, as well as for reporting issues or creating requests. Within the system, HPC components are monitored and supported centrally in order to simplify maintenance and enable the client side team to engage on projects such as overall optimisation of applications, workflows, and specialised system efficiency.

Iterate with the future in mind.

Dan’s engineering work is focused on eliminating over repeating problems of the past and freeing up the client to look towards the future.  His work in managed services has shown him that in keeping a solid process of monitoring and improvement in place, as well as documenting and strategically engaging with the clients and researchers, can allow everyone to get the results they are looking for – with very little idle time.

Get the full picture.

Would you like to see Dan’s full presentation? Check it out below:

In our next post we’ll bring all this knowledge together and talk about how the University of York is touching on each area of sustainable improvement, and where they hope to impact the global HPC community in many positive ways.

Wait, there’s more...

Discover our other blog posts