Life Sciences

HPC Lens for the AWS Well-Architected Framework

Issue link: https://read.uberflip.com/i/1187300

Contents of this Issue

Navigation

Page 33 of 46

Amazon Web Services – HPC Lens AWS Well-Architected Framework Page 30 out intermediate results. They can be used to diagnose an application error or to restart the case as needed, while only losing the work done between the last checkpoint and the job failure. HPCREL 2: Does your application support checkpointing to recover from failures? Checkpointing is particularly useful on Spot Instances, that is, when using highly cost-effective but pre-emptible nodes that can be interrupted at any time. In addition, some applications may benefit from checkpointing and changing the default Spot interruption behavior (e.g., stopping or hibernating the instance rather than terminating it). Lastly, it is important to consider the durability of the storage option when relying on checkpointing for failure management. Failure tolerance can potentially be improved by deploying to multiple Availability Zones. The low-latency requirements of tightly coupled HPC applications require that each individual case reside within a single placement group and Availability Zone. Alternatively, HTC applications do not have such low-latency requirements and can improve failure management with the ability to deploy to several Availability Zones. Consider the tradeoff between the reliability and cost pillars when making this design decision. Duplication of compute and storage infrastructure (for example, a head node and attached storage) incurs additional cost, and there may be data-transfer charges for moving data to an Availability Zone outside of the origin AWS Region. For non-urgent use cases, it may be preferable to only move into another Availability Zone as part of a disaster recovery (DR) event. Another possibility is to do nothing in response to an Availability Zone interruption and simply wait a short time for it to recover. HPCREL 3: How have you planned for failure tolerance in your architecture?

Articles in this issue

view archives of Life Sciences - HPC Lens for the AWS Well-Architected Framework