Life Sciences

Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Issue link: https://read.uberflip.com/i/1358110

Contents of this Issue

Navigation

Page 12 of 33

Genomics Data Transfer, Analytics, and Machine Learning using AWS Services AWS Whitepaper Reference architecture Use a tenant/year/month/day partitioning scheme in Amazon S3 to support multiple data providers—Data producers provide recurring delivery of datasets that need to be ingested, processed, and made available for query. Partitioning the incoming datasets by tenant/year/month/day allows you to maintain versions of datasets, lifecycle the data over time, and re-ingest older datasets, if necessary. Use data lifecycle management in Amazon S3 to lifecycle data and restore, if needed—Manage your data lake objects for cost effective storage throughout their lifecycle. Archive data when it is no longer being used and consider Amazon S3 Intelligent-Tiering if the access patterns are unpredictable. Convert your datasets to parquet format to optimize query performance—Parquet is a compressed, columnar data format optimized for big data queries. Analytics services that support columnar format only need to read accessed columns which greatly reduces I/O and speed of data processing. Treat configuration as code for jobs and workflows—Fully automate the building and deployment of ETL jobs and workflows to more easily move your genomics data into production. Automation provides control and a repeatable development process for handling your genomics data. Reference architecture Figure 3: Tertiary analysis with data lakes reference architecture 1. A CloudWatch Events triggers an ingestion workflow for variant or annotation files into the Amazon S3 genomics data lake. 2. A bioinformatician uses a Jupyter notebook to query the data in the data lake using Amazon Athena with the PyAthena python driver. Queries can also be performed using the Amazon Athena console, AWS CLI, or an API. Processing and ingesting data into your Amazon S3 genomics data lake starts with triggering the data ingestion workflows to run in AWS Glue. Workflow runs are triggered through the AWS Glue console, the AWS CLI, or using an API within a Jupyter notebook. You can use a Glue ETL job to transform annotation datasets like Clinvar from TSV format to Parquet format and write the Parquet files to a data lake bucket. You can convert VCF to Parquet in an Apache Spark ETL job by using open-source frameworks like Hail to read the VCF into a Spark data frame and then write the data as Parquet to your data lake bucket. Use AWS Glue crawlers to crawl the data lake dataset files, infer their schema, and create or update a table in your AWS Glue data catalog, making the dataset available for query with Amazon Redshift or Amazon Athena. To run AWS Glue jobs and crawlers in a workflow, use AWS Glue triggers to stitch together workflows, then start the trigger. To run queries on Amazon Athena, use the Amazon Athena console, AWS CLI, or an API. You can also run Athena queries from within Jupyter notebooks using the PyAthena python package that can be installed using pip. Optimizing data lake query cost and performance are important considerations when working with large amounts of genomics data. To learn more about optimizing data lake query performance with Amazon 10

Articles in this issue

Links on this page

view archives of Life Sciences - Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services