Genomics Data Transfer, Analytics, and Machine
Learning using AWS Services AWS Whitepaper
Introduction
When running genomics workloads in the Amazon Web Services (AWS) Cloud, how does an organization
manage cost, optimize workload performance, and move fast with control? How does an organization
secure sensitive information? What resources are available to help meet a team's compliance needs? How
does an organization perform analytics using machine learning?
This paper answers these questions by showing how to build a next-generation sequencing (NGS)
platform from instrument to interpretation using AWS services. We'll provide recommendations and
reference architectures for developing the platform including: 1) transferring genomics data to the AWS
Cloud and establishing data access patterns, 2) running secondary analysis workflows, 3) performing
tertiary analysis with data lakes, and 4) performing tertiary analysis using machine learning.
The genomics market is highly competitive so having a development lifecycle that allows you to move
fast with control is critical. Solutions for three of the reference architectures in this paper are provided
in AWS Solutions Implementations. These solutions leverage continuous delivery (CD), allowing you to
develop the solution to fit your organizational need.
Note
To access an AWS Solutions Implementation providing an AWS CloudFormation template
to automate the deployment of the secondary analysis solution in the AWS Cloud, see the
Genomics Secondary Analysis Using AWS Step Functions and AWS Batch Implementation Guide.
To access an AWS Solution Implementation providing an AWS CloudFormation template to
automate the deployment of the tertiary analysis and data lakes solution in the AWS Cloud,
see the Genomics Tertiary Analysis and Data Lake Using AWS Glue and Amazon Athena
Implementation Guide.
To access an AWS Solution Implementation providing an AWS CloudFormation template to
automate the deployment of the tertiary analysis and machine learning solution in the AWS
Cloud, see the Genomics Tertiary Analysis and Machine Learning using Amazon SageMaker.
A summary of the services used in this platform is shown in Table 1. You can learn about the compliance
resources available to you in Compliance resources (p. 19).
Table 1 – AWS services for data transfer, secondary analysis, and tertiary analyses
Data Transfer Secondary Analysis Tertiary Analysis
Data Access Patterns
AWS DataSync
AWS Storage Gateway for files
Secondary Analysis
AWS Step Functions
AWS Batch
Data Lakes
Amazon Athena
AWS Glue
Cost Optimization
AWS DataSync
Amazon S3
Monitor & Alert
Amazon CloudWatch
Machine Learning
Amazon SageMaker
DevOps
AWS CodeCommit
AWS CodeBuild
AWS CodePipeline
DevOps
AWS CodeCommit
AWS CodeBuild
AWS CodePipeline
2