Life Sciences

Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Issue link: https://read.uberflip.com/i/1358110

Contents of this Issue

Navigation

Page 29 of 33

Genomics Data Transfer, Analytics, and Machine Learning using AWS Services AWS Whitepaper Appendix K: Optimizing the performance of data lake queries For the solution implementation using tertiary analysis and data lakes, we optimized Amazon Athena performance based on the recommendations in Top 10 Performance Tuning Tips for Amazon Athena. We partitioned the variant data on the sample ID field, sorted each sample on location, and converted the data to Apache Parquet format. Partitioning in this way allows cohorts to be built optimally based on sample IDs. New samples can be ingested efficiently into the data lake without recomputing the data lake dataset. The annotation data sources are also written in Apache Parquet format to optimize for performance. If you need to query by location (chromosome, position, reference, alternate–CPRA), either create a sample ID to location ID lookup table in a database like Amazon DynamoDB or create a duplicate of the data partitioned on location. 27

Articles in this issue

Links on this page

view archives of Life Sciences - Whitepaper: Genomics Data Transfer, Analytics, and Machine Learning using AWS Services