Part Two: The nUCLeus cloud HPC cluster environment
For over five years the Alces Flight crew has been studying cloud HPC adoption within HPC services. In part two we focus on the design of the nUCLeus cloud HPC cluster environment.
Applying project goals to cluster environment architecture.
The CompBioMed Center of Excellence focuses on turning the science fiction of personalised medicine into the science fact of everyday clinical use. With students, researchers, clinicians, and medical staff queuing up to learn more about how computational biomedicine can improve health and wellbeing the team focused on establishing foundational goals for HPC bioscience education.
With a consortia spanning the UK and EMEA, a desire to bring more people into the field, and a need for a flexible, scalable resource, the team at CompBioMed chose public cloud as their starting point for building up their HPC training.
With a platform and high level goals decided, the team set about creating a cluster environment architecture that would fit with their aims. Here’s how these goals impacted the design of nUCLeus:
Goal One: Keep it complimentary.
Thanks to CompBioMed’s partnerships they have access to some of the most powerful supercomputers in the world — so how would public cloud architecture compliment these resources and not favour one system over the other? In two ways: by designing a platform agnostic cluster environment and not federating the environment to a specific on-premises system.
The Unfederated Agnostic nUCLeus.
The team took the decision to design architecture that allows nUCLeus to be unfederated and platform agnostic. This means nUCLeus is purposefully designed to move across platforms and change with the consortia’s needs. With such a strong set of institutions and partners working alongside CompBioMed it is important nUCLeus adapt and change based on who they are working with and what audience they are educating. While nUCLeus’ home is currently in the public cloud the team constructed the components utilising the OpenFlightHPC stack. This open-source project, backed by Alces Flight, constructs HPC cluster environments on hardware, hybrid, and cloud systems. By working in open-source CompBioMed is able to avoid any specific platform lock-in and can utilise the nUCLeus HPC cluster environment where it is most beneficial to the consortia and their students.
Goal Two: Consolidate Knowledge
The CompBioMed team chose to target a popular, well-documented, low compute-intensive HPC application as part of their foundation coursework into HPC. This application, QIIME2 (and it’s predecessor, QIIME) had already been a part of several onsite training courses — so bringing the knowledge together into one flexible, scalable cloud environment would be ideal for repeating and building on the course over time. The key to the HPC cluster environment would be ensuring there was an ease of management involved.
Alces Flight Center + nUCLeus
The CompBioMed team chose to subscribe to Alces Flight Center as their means of managing the process of knowledge consolidation and replication of HPC cluster environments. Providing a centralised access for the creation, management, and evolution of HPC clusters and environments Flight Center utilises a series of components, or building blocks, to help clients structure their cluster management. For CompBioMed this meant:
- Keeping a small, persistent ‘test nUCLeus’ in place during the consolidation of knowledge.
- Creating and managing tests of nUCLeus using different instance types in order to agree on two ‘minimum viable clusters’ for the students to use in the live course.
- Creating and managing a series of events around workshops where the cluster environment would need to scale to match quick turnaround of results.
- Maintaining a persistent nUCLeus HPC cluster environment post-course to clean, optimise, and turnaround for the next cohort of students.
All of this was tasked and managed through Alces Flight Center, allowing the team to document and improve upon process which would not only impact the QIIME2 course but would feed into the structural creation of further foundation classes.
Goal Three: Think Collaboratively
Because this course had partners (UCL, the University of Sheffield, and Alces Flight) working across several remote sites clear technical goals needed to be established and adhered to from the start. The team centred these goals around the methodology of “Who, How and When.”
Who would be involved
Planning for the course showed periods where specific team members would require more or less access to the cluster environment itself. Tracked via Alces Flight Center, the team noted points between testing, student onboarding, live course and post-course access where key decisions needed to be transferred to different members of the team. While the cluster environment maintained a consistent project lead, Alces Flight Center managed and documented when and how team members, and eventually the students, accessed and used the cluster environment. This information was then fed into a template where future team plans could be drawn up and improved upon over time.
How would it work
Probably the most impactful work done in nUCLeus was the establishment of a minimum viable cluster for QIIME2. Because the course had been hosted on a number of platforms prior the team had some idea of the resources needed — now they just needed to optimise. Because public cloud can present an overwhelming number of choices the team chose to narrow the field by picking an instance type which most aligned with their last course and then scaled up and down to see where the best possible configuration existed. It is important to note that for this course there would be two student accessible versions of nUCLeus — the everyday workload version and the time-constrained workshop version (nicknamed ‘nUCLeolus’). Finding the optimum version of these cluster environments (while also testing their course materials) ensured that come day one the students would be able to focus on learning and not dealing with potential time lags or problems with the workflows that weren’t intentional.
When would resources be needed
Having an established team and a minimum viable cluster environment, it was now down to managing the service itself. The project lead set up a series of events within Alces Flight Center spelling out the technical changes nUCLeus needed to make as students progressed through the course. By doing so the team could work proactively on estimating budget over time as well as identify any slack points needed in how long resources could be utilised. As each event passed it was documented and, if needed, the next event adjusted to take on any lessons learned.
All of this information would then be wrapped up into an even more refined process for the next course.
How the nUCLeus architecture can help you plan your next cloud project.
While your course planning may be different, here are five questions you can ask to help move your objectives forward:
- Does the cluster environment need to mirror an on-premises system?
- Does (or will) the cluster environment need to move across platform?
- How will you manage your cloud cluster environment?
- How will your testing phase be designed and managed?
- Do you know the ‘who, how, and when’ of your cluster environment’s use?
Part Three of the series concludes with the lessons learned — including what nUCLeus is up to now. Read it here. Want to know the background to nUCLeus? Read Part One now. Or, if you would like to start planning your own cloud HPC education project simply get in touch.