Life Sciences

Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Issue link: https://read.uberflip.com/i/1358110

Contents of this Issue

Navigation

Page 5 of 33

Genomics Data Transfer, Analytics, and Machine Learning using AWS Services AWS Whitepaper Recommendations Transferring genomics data to the Cloud and establishing data access patterns using AWS DataSync and AWS Storage Gateway for files Transferring genomics data to the AWS Cloud requires preparation in order to manage data transfer and storage cost, optimize data transfer performance, and manage the data lifecycle. Recommendations When you are ready to transfer your organization's genomics data to the AWS Cloud, consider the following recommendations to help optimize the data transfer process. Use Amazon Simple Storage Service (Amazon S3) for genomics data storage—Genomics data is persisted in files by sequencers while genomics analysis tools take files as inputs and write files as outputs. This makes Amazon S3 a natural fit for storing genomics data, data lake analytics, and managing the data lifecycle. Use the Amazon S3 Standard-Infrequent Access storage tier for transferring genomics data to Amazon S3—Typically, genomics Binary Base Call (BCL) files and Binary Alignment Map (BAM) files are accessed infrequently, perhaps a few times in a month. These files can be stored using the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) tier to lower monthly cost when compared to storing the data using the S3 Standard access tier. Storing these file types using Amazon S3 Standard for 30 days before moving the data to S3 Standard-IA is more expensive than moving the data to infrequent access immediately. However, if you're organization accesses BCL and BAM files frequently, store those files in S3 Standard. Manage genomics data lifecycle by archiving to a low-cost storage option such as Amazon S3 Glacier Deep Archive—Typically, genomics data is written to Amazon S3 using the Standard-Infrequent Access storage class before transitioning for long term storage to a lower cost storage option such as Amazon S3 Glacier Deep Archive. Even if you restore a sizeable amount of the data from archival storage to infrequent-access in a month, there is still a significant cost savings in archiving the data. Perform the storage class analysis and compute your expected cost before making any changes to your lifecycle policies in Amazon S3. You can learn more about archiving genomics data in Optimizing storage cost and data lifecycle management (p. 22). Stage genomics data on-premises first before uploading the data to Amazon S3—To keep genomics sequencers running 24/7, sequencer output files such as Binary Base Call (BCL) files are written to on-premises storage first before uploading those files to the cloud. If there is a network outage, the sequencers can continue to run for as long as you have local storage available. Verify that you have enough local storage available to meet your organization's disaster recovery plans including Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Data can always be written to external storage on-premises before being uploaded to the cloud. Filter genomics sequencing data before uploading the data to the cloud—Consider eliminating log and thumbnail files that are not used for cloud-based analytics to minimize transfer cost, transfer time, and storage cost for a specified instrument run. 3

Articles in this issue

Links on this page

view archives of Life Sciences - Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services