A Guide to Setting Up HPC Cluster on Cloud

Introduction

High-Performance Computing (HPC) clusters play a pivotal role in advancing scientific and computational research, powering simulations, data analysis, and complex calculations that traditional computing infrastructure may struggle to handle. As the demand for faster, more scalable, and efficient computing resources continues to grow, researchers and organizations are increasingly turning to cloud platforms to meet these requirements.

In recent years, cloud platforms like AWS (Amazon Web Services), Azure (Microsoft Azure), and GCP (Google Cloud Platform) have emerged as key players in providing on-demand and scalable computing resources. The flexibility and agility offered by these cloud services make them attractive choices for HPC workloads, allowing users to harness immense computational power without the need for massive upfront investments in physical infrastructure.

The trend of migrating HPC workloads to the cloud is gaining momentum. Traditionally, HPC clusters were hosted on-premises or in specialized data centers, posing challenges in terms of scalability, resource management, and cost. Cloud platforms, with their pay-as-you-go models and a rich array of services, address these challenges by offering a dynamic and scalable environment for HPC tasks.

In this blog, we will delve into the intricacies of setting up HPC clusters using the Slurm scheduler, a widely adopted job scheduler and resource manager for HPC environments. Our focus will be on comparing the setup processes and evaluating the pros and cons of deploying Slurm on three major cloud platforms: AWS, Azure, and GCP.

HPC on AWS

AWS provides solutions for HPC workloads with AWS Parallel Cluster. This AWS-supported, open-source cluster management tool is tailored to simplify the deployment and management of HPC clusters on the AWS Cloud. Whether you prefer the command-line interface (CLI) for its efficiency or the user-friendly ParallelCluster User Interface (UI) for its simplicity, AWS ParallelCluster caters to both. This versatility in provisioning makes cluster setup accessible to a broad spectrum of users, from seasoned administrators to those newer to HPC environments. The ParallelCluster UI provides an intuitive graphical interface for managing clusters, making administrative tasks more accessible. Moreover, AWS ParallelCluster can be easily integrated with Congito Pool for user management.

HPC on Azure

Azure CycleCloud stands as a pivotal component in the Azure ecosystem, dedicated to simplifying the deployment, management, and optimization of HPC clusters. Azure CycleCloud offers both a CLI and a user-friendly web-based portal for deploying and managing HPC clusters. One of its notable strengths is its seamless integration with Slurm, a widely adopted job scheduler for HPC environments. This integration allows Azure CycleCloud users to benefit from the robust job scheduling, resource allocation, and workload management capabilities that Slurm provides. Azure CycleCloud goes beyond Slurm integration, allowing users to seamlessly incorporate other schedulers and custom configurations. Azure CycleCloud integrates seamlessly with Azure services such as Azure Virtual Machines, Azure Blob Storage, and Azure Networking, offering a holistic environment for HPC workloads. CycleCloud’s advanced resource management features ensure optimal utilization of compute resources, minimizing idle time and maximizing efficiency.

HPC on Google Cloud Platform

The seamless deployment of Slurm on Google Cloud Platform is made possible through multiple avenues, including the Cloud HPC Toolkit, Terraform, and the Google Cloud Marketplace.

Deployment Options:

Cloud HPC Toolkit:

The Cloud HPC Toolkit provides a streamlined and user-friendly approach to deploying Slurm on Google Cloud Platform. This toolkit offers a curated set of resources and configurations, ensuring a simplified setup process.

Terraform Integration:

For users who prefer infrastructure as code, the deployment of Slurm on Google Cloud Platform can be orchestrated directly through Terraform. This allows for the automation and version-controlled management of HPC clusters.

Google Cloud Marketplace:

The Google Cloud Marketplace serves as a centralized hub for discovering, deploying, and managing software solutions. Users can easily find and deploy Slurm directly from the Marketplace, streamlining the integration process.

While Google Cloud Platform (GCP) offers versatile deployment options for HPC using Slurm it’s important to note that GCP currently does not provide a UI component dedicated to monitoring these deployments.

This blog further dives into a comprehensive comparison of HPC solutions on these cloud providers, focusing on key aspects such as deployment methods, scheduler options, user management, storage integration, cost monitoring, and job submission.

Overview of Slurm Scheduler

The Slurm scheduler is a robust open-source tool designed for managing HPC workloads. At its core, Slurm efficiently allocates resources, schedules jobs, and monitors cluster activity. Key concepts include:

Control Nodes: Central servers managing the entire cluster, hosting the slurmctld daemon.

Compute Nodes: Worker nodes executing jobs, each running the slurmd daemon.

Job Scheduling: Intelligent assignment of jobs to resources based on priority, availability, and policies.

Partitions: Logical subdivisions enabling grouping of nodes based on factors like hardware or user access.

Benefits include scalability, flexibility, and strong community support. In the blog, we’ll explore setting up Slurm on AWS, Azure, and GCP, examining how each integrates with Slurm for optimal HPC performance.

Setting up HPC Cluster comparison on AWS, Azure and GCP

Comparison	AWS	Azure	GCP
Service/Tool Used	Parallel Cluster	Cycle Cloud	Cloud HPC toolkit
Scheduler	Slurm or AWS Batch scheduler	Slurm, other built-in schedulers and can add custom schedulers.	Only Slurm
Deployment method	CLI or cloud formation script	Azure Marketplace or ARM template	Google Console only
User Interface support	PCluster UI	Cycle Cloud UI	No UI component. Users SSH directly into Head Node.
User Management And AD integration	Supports Active Directory (AD) integration. Cognito Pool is used for Admin access to Parallel Cluster UI	Support AD and Azure AD integration through custom projects. It also has built in User Management configuration through Cycle Cloud UI.	No direct user management integration available. Need to add users through CLI by logging into the head node.
Storage Integration	Supports EFS and EBS integrations.	Can attach NFS file system while creating cluster.	Can mount file Systems and NFS servers.
Cost Monitoring	PCluster Dashboard is used for cost management which is integrated using tags to Cost Explorer.	Cycle Cloud Dashboard provides metrics for cost monitoring.	No Specific Dashboard Available uses GCP’s Cost explorer directly to visualize cost.
Alerting	CloudWatch alarm	Set alerts for specific usage quota using CycleCloud UI.	Can use Budgets and Alerts service of GCP to receive notifications for several thresholds.
Job Submission	SSH into the head node to submit jobs	SSH into the head node to submit jobs	SSH into the head node to submit jobs

Conclusion

Choosing the right cloud provider and HPC solution depends on specific requirements and preferences. AWS ParallelCluster offers a robust set of features with dedicated UI support and AD integration. Azure CycleCloud provides flexibility with support for multiple schedulers and a user-friendly UI. Google Cloud’s Cloud HPC Toolkit focuses on simplicity, with a direct deployment method via the console.

In the end, the decision should align with your organization’s priorities, budget considerations, and the specific demands of your HPC workloads.