Life Sciences

Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Issue link: https://read.uberflip.com/i/1358110

Contents of this Issue

Navigation

Page 6 of 33

Genomics Data Transfer, Analytics, and Machine Learning using AWS Services AWS Whitepaper Reference architecture Use AWS DataSync to transfer data to Amazon S3—AWS DataSync makes it simple to transfer large amounts of data to Amazon S3 with minimal IT operational burden and optimal data transfer performance. DataSync eliminates or handles common tasks including scripting copy jobs, scheduling and monitoring transfers, validating data, and optimizing network utilization. If file-system based access to data in Amazon S3 is required, use Amazon FSx or AWS Storage Gateway—Many research organizations use third-party tools, open-source tools, or their own tools to work with their research data. These tools often use file system-based access to data. Consider creating an Amazon Elastic Compute Cloud (Amazon EC2) instance to perform analytics on data in Amazon S3. If your applications require file-based access to Amazon S3, use Amazon FSx to provide a file-system that can be mounted on your Amazon EC2 instance. If your applications must run on-premises and require file-based access to Amazon S3, use File Gateway. Reference architecture Transferring your organization's genomics data to Amazon S3 using AWS DataSync starts with setting up your sequencing instruments to write data to a common folder on your on-premises storage system. Writing first to on-premises storage enables you to take advantage of the high availability (HA) built into your storage system and stage your data for processing before transferring to the cloud. Figure 1: Process workflow using a run completion tracker script with AWS DataSync Figure 1 shows the process workflow using a run completion tracker script with AWS DataSync: 1. A technician loads a genomic sample on a sequencer. 2. The genomic sample is sequenced and written to a landing folder that is stored in a local on-premises storage system. 3. An AWS DataSync sync task is preconfigured to sync the data from the parent directory of the landing folder on on-premises storage, to an Amazon S3 bucket. 4. A run completion tracker script starts a DataSync task run to transfer the run data to an Amazon S3 bucket. An inclusion filter can be used when running a DataSync task run, to only include a given run folder. Exclusion filters can be used to exclude files from data transfer. In addition, consider incorporating a zero-byte file as a flag when uploading the data. Technicians can then indicate when a run has passed a manual Quality Control (QC) check by placing an empty file in the data folder. Then, the run completion tracker script will only trigger a sync task if the success file is present. 5. On-premises researchers use existing bioinformatics tools with data in Amazon S3 via NFS or SMB using the File Gateway solution from AWS Storage Gateway. 4

Articles in this issue

view archives of Life Sciences - Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services