Genomics Data Transfer, Analytics, and Machine
Learning using AWS Services AWS Whitepaper
Appendix K: Optimizing the
performance of data lake queries
For the solution implementation using tertiary analysis and data lakes, we optimized Amazon Athena
performance based on the recommendations in Top 10 Performance Tuning Tips for Amazon Athena.
We partitioned the variant data on the sample ID field, sorted each sample on location, and converted
the data to Apache Parquet format. Partitioning in this way allows cohorts to be built optimally based
on sample IDs. New samples can be ingested efficiently into the data lake without recomputing the data
lake dataset. The annotation data sources are also written in Apache Parquet format to optimize for
performance. If you need to query by location (chromosome, position, reference, alternate–CPRA), either
create a sample ID to location ID lookup table in a database like Amazon DynamoDB or create a duplicate
of the data partitioned on location.
27