Install and Configure ParallelCluster

The next step is to install ParallelCluster, which we’ll use to create our cluster. A similar process can be made to install it on your local machine but for simplicity we’ll use AWS Cloudshell.

 pip-3.7 install aws-parallelcluster==2.10.2 --user

In this tutorial, we are using AWS ParallelCluster version 2.10.2, however, outside of this tutorial it is recommended that the most recent version of AWS ParallelCluster be used.

Now that we have installed AWS ParallelCluster, we have access to the terminal command line tool “pcluster”. First, let’s confirm that we have the correct version installed.

pcluster version
2.10.2

We’ll go over the basics of how to use the “pcluster” command in this tutorial, but if you need help remembering commands at any time, see the command line usage documentation.

 pcluster -h
 usage: pcluster [-h]
     {create,update,delete,start,stop,status,list,instances,ssh,createami,configure,version,dcv}
                ...

pcluster is the AWS ParallelCluster CLI and permits launching and management
of HPC clusters in the AWS cloud.

[... output trimmed ...]

For a complete User Guide to using AWS ParallelCluster, use this link

AWS ParallelCluster Setup

Next, let’s configure AWS ParallelCluster.

Enter pcluster configure in a terminal window and walk through the configure menu. This step generates a config file but will be modified in the next section. See below for the options to be selected for this particular workshop (select a number or press enter to accept the default answer).

Please remember to say yes to the automatic VPC creation and automatic subnet selection with a public headnode and private compute nodes (for production you may want to change the network settings to match your security and connectivity needs).

pcluster configure
Allowed values for AWS Region ID:
1. ap-northeast-1
. . .
16. us-west-2
AWS Region ID [us-east-1]: eu-west-1
Allowed values for EC2 Key Pair Name:
1. your-key
EC2 Key Pair Name [your-key]:  #This is where you type the name of the key you previously generated e.g. cfd_ireland
Allowed values for Scheduler:
1. sge
2. torque
3. slurm
4. awsbatch
Scheduler [slurm]: 
Allowed values for Operating System:
1. alinux
2. alinux2
3. centos7
4. centos8
5. ubuntu1604
6. ubuntu1804
Operating System [alinux2]: alinux2 
Minimum cluster size (instances) [0]: 0
Maximum cluster size (instances) [10]: 10
Head node instance type [t2.micro]: c5n.large
Compute instance type [t2.micro]: c5n.18xlarge
Automate VPC creation? (y/n) [n]: y
Automate VPC creation? (y/n) [n]: y
1. Head node in a public subnet and compute fleet in a private subnet
2. Head node and compute fleet in the same public subnet
Network Configuration [Head node in a public subnet and compute fleet in a private subnet]:1.
Beginning VPC creation. Please do not leave the terminal until the creation is finalized
Creating CloudFormation stack...
Do not leave the terminal until the process has finished
Stack Name: parallelclusternetworking-pubpriv-20201230182104
Status: parallelclusternetworking-pubpriv-20201230182104 - CREATE_COMPLETE      
The stack has been created
Configuration file written to /home/cloudshell-user/.parallelcluster/config
You can edit your configuration file or simply run 'pcluster create -c /home/cloudshell-user/.parallelcluster/config cluster-name' to create your cluster

Modify the default AWS ParallelCluster Config File

Upon initial configuration, AWS ParallelCluster places a default configuration file, “config”, into the ~/.parallelcluster directory. This file is a plain text file that is human-readable and can be edited with any standard text editor. For common workloads, the configuration process is a one-time event since multiple clusters can be launched using the same settings. For minor changes, you can modify the config file manually, while for larger changes you should revisit the guided “pcluster configure” process.

If you are comfortable with editing files in linux, then please use your editor of choice e.g VI.

 vi ~/.parallelcluster/config

Alternatively you can download the file (as shown below), edit it on your own computer and then upload it.

cluster

Your config file should look something like the one below (do not use the subnet/vpc names I’ve listed below but keep your own).

[aws]
aws_region_name = eu-west-1

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
key_name = cfd_ireland
scheduler = slurm
master_instance_type = c5n.large
base_os = alinux2
vpc_settings = default
queue_settings = compute

[vpc default]
vpc_id = vpc-xxxxxxxxxxx
master_subnet_id = subnet-xxxxxxxxxxx
compute_subnet_id = subnet-yyyyyyyyyy
use_public_ips = false

[queue compute]
enable_efa = false
enable_efa_gdr = false
compute_resource_settings = default

[compute_resource default]
instance_type = c5n.18xlarge

Please change it to look like the one below which includes extra settings which we will walk through for each of the changes.

[aws]
aws_region_name = eu-west-1

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
key_name = cfd_ireland
scheduler = slurm
master_instance_type = c5n.large
base_os = alinux2
vpc_settings = default
queue_settings = compute,mesh
s3_read_write_resource = *
dcv_settings = default
fsx_settings = fsxshared

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
deployment_type = PERSISTENT_1
storage_type = SSD
per_unit_storage_throughput = 100
daily_automatic_backup_start_time = 00:00
automatic_backup_retention_days = 30

[dcv default]
enable = master

[vpc default]
vpc_id = vpc-xxxxxxxxxxxxx
master_subnet_id = subnet-xxxxxxxxxxx
compute_subnet_id = subnet-yyyyyyyyyy
use_public_ips = false

[queue compute]
enable_efa = true
placement_group = DYNAMIC
disable_hyperthreading = true
compute_type = ondemand
compute_resource_settings = default

[compute_resource default]
instance_type = c5n.18xlarge
min_count = 0
max_count = 10

[queue mesh]
placement_group = DYNAMIC
disable_hyperthreading = true
compute_type = ondemand
enable_efa = false
compute_resource_settings = defaultmesh

[compute_resource defaultmesh]
instance_type = m5.24xlarge
min_count = 0
max_count = 10

So lets go through the changes:

  • master_instance_type = c5n.large for headnode

    • The c5n.large has 2 vCPU and 5.25GB of RAM which is large enough to allow SSH access and compilation of certain codes but with a low enough cost to be left on all the time.
  • placement_group = DYNAMIC

    • Placement groups ensure that compute nodes are located within close physical proximity to make inter-node communication faster for tightly coupled workloads.
  • disable_hyperthreading = true

    • Improves performance of typical CFD workloads
  • We will add the settings necessary to set up DCV (a high-performance remote display protocol for remote visualization) on the master instance.

    • Create a new section [dcv default] and add the line enable = master
    • Add dcv_settings = default under the [cluster default] section.
  • s3_read_write_resource = *

    • We will add this under [cluster default] section which will allow access to your S3 buckets from your ParallelCluster environment. Typically for situations outside of this tutorial you would also allow access to specific buckets rather than all (via the *) but for this tutorial we will allow access to all your buckets.
  • [compute_resource defaultmesh], [queue mesh] & queue_settings = compute,mesh

    • We’ve added an additional queue for meshing - where we may want an instance with more RAM and doesn’t need to be network optimized. The m5.24xlarge has 384GB RAM compared to the c5n.18xlarge 192GB. You could follow this process to add additional queues or change the instance types.
  • [fsx_settings] & [fsx fsxshared] (under cluster section and it’s own section)

    • An additional storage service is Amazon FSx for Lustre. This is suitable for when you need high throughput and a parallel file-system. Typically for a single user running smaller files (<10GB) then the EBS volume will have enough performance but for larger cases and multiple users Amazon FSx for Lustre may be optimum. Lets look at the settings we picked:

Amazon FSx for Lustre

Amazon FSx for Lustre provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA). These workloads commonly require data to be presented via a fast and scalable file system interface, and typically have datasets stored on long-term data stores like Amazon S3.

You can read more about FSx for Lustre here but lets examine the choices taken for this workshop

storage_capacity = 1200
deployment_type = PERSISTENT_1
storage_type = SSD
per_unit_storage_throughput = 100
daily_automatic_backup_start_time = 00:00
automatic_backup_retention_days = 30
  • storage_capacity = 1200
    • We’ve chosen 1200 Gib (1.2 TiB), which is the minimum volume size for this FSx Lustre type. You could pick a much larger value e.g 500TiB or even higher but for this workshop we can stick with the smallest size
  • deployment_type = PERSISTENT_1
    • You can read about the different value types here , but we’ve chosen the option that is replicated so if a file server fails you won’t lose your data.
  • storage_type = SSD
    • We’ve chosen the SSD version that this provides higher throughput values per TB, which will help with the saving of larger files
  • per_unit_storage_throughput = 100
    • For this volume type, we can pick 50,100 or 200 MiB/s/TiB and we’ve chosen 100 as the middle-ground. You could pick a higher or lower version depending on your needs
  • daily_automatic_backup_start_time = 00:00
    • With the persistent option we can schedule a daily backup that will backup the disk to S3, this can be used for future file systems although these automated backups will be deleted if you delete your file system. See the ‘Clean up resources’ section for creating a manual backup that will persist even if you delete the file system.
  • automatic_backup_retention_days = 30
    • We’ve picked that each backup lasts for a maximum of 30 days. You can use AWS Backup to schedule longer backups and read more about that here

Finally, it is possible to create a FSx for Lustre file system outside of the ParallelCluster process and simply attach it to your cluster using the line:

fsx_fs_id = fs-xxxx

Where fs-xx is the ID that is taken from the FSx for Lustre console page. If you do this option make sure you also add the additional security group that the file system was created to the cluster, using the following line under the VPC section:

additional_sg = sg-xxx