Life Sciences

Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Issue link: https://read.uberflip.com/i/1358110

Contents of this Issue

Navigation

Page 9 of 33

Genomics Data Transfer, Analytics, and Machine Learning using AWS Services AWS Whitepaper Reference architecture Reference architecture Figure 2: Secondary analysis reference architecture using AWS Step Functions and AWS Batch Figure 2 shows the reference architecture for the secondary analysis workflow using AWS Step Functions and AWS Batch. 1. A CloudWatch event triggers an AWS Lambda function to start an AWS Step Functions secondary analysis state machine. 2. Step Functions submits jobs to AWS Batch to run in a Spot Instance compute environment. The Docker images for the tools are stored in private Amazon Elastic Container Registry (Amazon ECR) repositories in the customer's account. The job output files are written to an Amazon S3 bucket, including the Variant Call File (VCF) and Binary Alignment Map (BAM) file. AWS Step Functions manages the execution of AWS Batch jobs for alignment, variant calling, QC, annotation, and custom processing steps. Running secondary analysis using AWS Step Functions and AWS Batch starts by triggering a Step Functions state machine to run. For example, you can use either the AWS Command Line Interface (AWS CLI) or an AWS API client. To fully automate the process, you can set up an Amazon CloudWatch rule to run the state machine when genomic sample FASTQ files are put in an Amazon S3 bucket. The rule could start the state machine execution directly or the event could be put in an Amazon Simple Queue Service (Amazon SQS) queue that is polled by an AWS Lambda function that starts the state machine execution. Using a Lambda function to start a state machine makes it easy to add quality checks or routing logic, that is, to check for missing files or route to a particular secondary analysis workflow. You can pass parameters when running a Step Functions state machine and use them for each job in your secondary analysis workflow. Use a convention to read and write objects to Amazon S3 so that you can compute, up front, the input and output paths for each tool when you execute the state machine. Once Step Functions submits a job to an AWS Batch queue, the job is scheduled by AWS Batch to run on an instance in a compute environment associated with the queue. Compute environments are used in the order they are listed in the compute environments list in the queue. The job will be scheduled to run on an instance that matches the resource requirements specified in the job definition, such as compute and memory required. Using EC2 Spot Instances and allowing AWS Batch to choose the optimal instance type can help optimize compute costs by increasing the pool of instances types to choose from in EC2 Spot Instances. To learn more about optimizing secondary analysis compute cost, see Optimizing secondary analysis compute cost (p. 23). 7

Articles in this issue

view archives of Life Sciences - Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services