For more information on how AWS can help your organization with Genomics visit us at: aws.amazon.com/health/genomics
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Genomics Data Sources
A technician loads a sample on a sequencer. The sample is
sequenced and written to a landing folder on local on-premises
storage. An AWS DataSync sync task is set up to sync the
data from the local hot folder to a bucket in Amazon Simple
Storage Service (Amazon S3). Because genomics data is
persisted in files by sequencers, while genomics analysis tools
take files as inputs and write files as outputs, Amazon S3 is a
natural fit for genomics data, data lake analytics, and managing
the data storage lifecycle on AWS.
Phenotypic Data Sources
Research scientists and clinical researchers can upload
annotation and clinical data as zip files to Amazon S3 via AWS
Transfer for SFTP
Data Transfer
AWS DataSync is used to transfer raw genomics data from
on-premises sequencers. AWS Transfer for SFTP can be used
by research scientists to transfer clinical or annotation data to
Amazon S3 buckets. AWS DataSync makes it easier and more
cost effective to move large amounts of data online between
on-premises storage and AWS storage services like Amazon S3.
AWS DataSync handles common tasks including scripting copy
jobs, scheduling and monitoring transfers, validating data, and
optimizing network utilization.
Storage and Archival
Optimize storage by writing instrument run data to an Amazon
S3 bucket configured for infrequent access. Identify your
Amazon S3 storage access patterns to optimally configure your
bucket lifecycle policy. Use Amazon S3 analytics storage class
analysis to analyze your storage access patterns and update
your lifecycle policies appropriately. For your analysis, use an
observation period of at least 30 days. Amazon Glacier is a
secure, durable, and extremely low-cost storage service for data
archiving. Use Amazon Glacier for multiple tiers of data retrieval
based on your specific needs, ranging from a few minutes to
several hours.
File-Based Data Access to Amazon S3
Researchers on-premises use existing bioinformatics tools with
data in Amazon S3 via NFS or SMB using AWS Storage Gateway
for Files. AWS Storage Gateway enables on-premises access
to virtually unlimited cloud storage, helping simplify storage
management. Many research organizations use third-party tools,
open-source tools, or their own tools to work with their research
data. These tools usually require file system-based access to
data. AWS Storage Gateway offers SMB or NFS based access
to data in Amazon S3, with local caching to optimize for data
access cost and performance.
Storage and Archival
Researchers can cloud burst from on-premises, or use data
already in Amazon S3, and use Amazon FSx for Lustre as a
super-fast processing tier to maximize performance across
all compute clusters. Amazon FsX for Lustre provides high-
performance storage that can handle compute-intensive
workloads, which helps speed time to insights in genomics
analyses. This service delivers sub-millisecond latencies, up to
hundreds of gigabytes per second of throughput, and millions of
IOPS, and is available as a fully managed service.
1
2
3
5
6
4