Life Sciences

HPC Lens for the AWS Well-Architected Framework

Issue link: https://read.uberflip.com/i/1187300

Contents of this Issue

Navigation

Page 32 of 46

Amazon Web Services – HPC Lens AWS Well-Architected Framework Page 29 Best Practices Foundations The cloud is designed to be essentially limitless, so it is the responsibility of AWS to satisfy the requirement for sufficient networking and compute capacity. AWS sets service limits (an upper limit on the number of each resource your team can request) to protect you from accidentally over-provisioning resources. HPC applications often require a large number of compute instances simultaneously. The ability and advantages of scaling horizontally are highly desirable for HPC workloads. However, it may require an increase to the AWS service limits before a large workload is deployed to either one large cluster or to many smaller clusters all at once. HPCREL 1: How do you manage AWS service limits for your accounts? Service limits often need to be increased from the default values to handle the requirements of a large deployment. You can contact AWS Support to request an increase. Change Management Being aware of how change affects a system allows you to plan proactively. Monitoring allows you to quickly identify trends that could lead to capacity issues or SLA breaches. In traditional environments, change-control processes are often manual and must be carefully coordinated with auditing to effectively control who makes changes and when they are made. Failure Management In any system of reasonable complexity, it is expected that failures will occur, and it is generally of interest to know how to become aware of these failures, respond to them, and prevent them from happening again. Failure scenarios can include the failure of a cluster to start up or the failure of a specific workload. Failure tolerance can be improved in multiple ways. For long-running cases, incorporating regular checkpoints in your code will allow you to continue from a partial state in the event of a failure. Checkpointing is a common feature of application-level failure management already built into many HPC applications. The most common approach is for the running application to periodically write

Articles in this issue

view archives of Life Sciences - HPC Lens for the AWS Well-Architected Framework