Life Sciences

Page 11 of 33

Genomics Data Transfer, Analytics, and Machine Learning using AWS Services AWS Whitepaper Recommendations Performing tertiary analysis with data lakes using AWS Glue and Amazon Athena Genomic tertiary analysis can be performed on data in an Amazon S3 data lake using AWS Glue, Amazon Athena, and Amazon SageMaker Jupyter notebooks. Recommendations When building and operating a genomics data lake in AWS, consider the following recommendations to optimize data lake operations, performance, and cost. Use AWS Glue extract, transform, and load (ETL) jobs, crawlers, and triggers to build your data workflows—AWS Glue is a fully managed ETL service that makes it easy for you to prepare and load data for analytics. You can create Spark jobs to transform data, Python jobs to perform PHI and virus scans, crawlers to catalog the data, and workflows to orchestrate data ingestion, all within the same service. Use AWS Glue Python jobs to integrate with external services—Use Python shells in AWS Glue to execute tasks in workflows that require callouts to external services such as running a virus scan or a personal health information (PHI) scan. Use AWS Glue Spark ETL to transform data—AWS Glue Spark jobs make it easy to run a complex data processing job across a cluster of instances using Apache Spark. Promote data across S3 buckets, multiple accounts are not necessary—Use different Amazon S3 buckets to implement different access controls and auditing mechanisms as data is promoted through your data ingestion pipeline, such as, quarantine, pre-curated, and curated. Segregating data across accounts is not necessary. For interactive queries, use Amazon Athena or Amazon Redshift— Query data residing in Amazon S3 using either Amazon Redshift or Athena. Amazon Redshift efficiently queries and retrieves structured and semi-structured data from files in Amazon S3 by leveraging Redshift Spectrum Request Accelerator to improve performance. Athena is a serverless engine for querying data directly in Amazon S3. Users who already have Amazon Redshift can extend their analytical queries to Amazon S3 by pointing to their AWS AWS Glue Data Catalog. Users who are looking for a fast, serverless analytics query engine for data on Amazon S3 can use Athena. Many customers use both services to meet diverse use cases. For queries that require low latency such as dashboards, use Amazon Redshift—Amazon Redshift is a large-scale data warehouse solution ideal for big data, and low latency queries, such as dashboard queries. Use partitions and the AWS Glue Data Catalog for data changes instead of creating new databases— Use table partitions in an AWS Glue Data Catalog to accommodate multiple versions of a dataset with minimal overhead and operational burden. Design data access around least privileges and provide data governance using AWS Lake Formation— Limit data lake users to select permissions only. Service accounts used for ETL may have create/update permissions for tables. 9

Articles in this issue

view archives of Life Sciences - Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Contents of this Issue

Navigation

Page 11 of 33

Articles in this issue