Life Sciences

Data Lakes: How to glean new insights from existing data

Issue link:

Contents of this Issue


Page 4 of 4

SHARE: Data lakes: How to glean new insights from existing data 5 than compute capacity, the use of Amazon EMR and S3 frees companies from paying for more compute capacity than they need. AWS customers have realized these benefits when setting up petabyte-scale data lakes that combine in-house sequencing data with publicly-available genomic repositories, annotation information, phenotype data and other resources. This pooling of data creates lakes with the scale to identify rare variants and the context to start understanding the significance of sequencing results. Companies are also using data lakes at the opposite end of the biopharma value chain, for example to analyze how to improve adherence to drug regimens. Having access to a pooled repository of data allows companies to ask questions about what actions improve adherence. The answer may differ depending on the drug and nature of the adherence problem. But the process of analyzing pooled data to identify problems and potential responses remains constant, as does the impact of improving adherence on health outcomes and company financials. HOW TO IMPLEMENT DATA LAKES The barriers to setting up data lakes have come down since early pioneers such as Merck used the approach. Initially, companies had to create the structure, metadata system and governance of their data lakes from scratch. Mistakes at this stage have serious implications. Failure to marry metadata to an effective search function makes it difficult to locate data. This is a common enough problem for the resulting hard-to-search repositories to have their own name: Data swamps. Data lake solutions from vendors such as AWS have eliminated this danger by ensuring websites point to where datasets are and describe what they contain. These solutions also simplify security and compliance by using the strong safeguards and administrative controls built into cloud services. The upshot is biopharma companies can now create secure and searchable data lakes in minutes. Companies that are seizing this opportunity recognize it is insufficient to simply generate data across their businesses. They know it is also essential to be smart about how that data is captured, characterized, shaped and given meaning. Data lakes are an enabler of this way of thinking. It is a way of thinking that goes beyond generating data to answer known questions today. Instead, it gathers, organizes and explores data to break new ground, answering questions nobody knew to ask and making discoveries for which nobody knew to look. l For over 10 years, Amazon Web Services has been the world's most comprehensive and broadly adopted cloud platform. AWS offers over 90 fully featured services for compute, storage, networking, database, analytics, application services, deployment, management, developer, mobile, Internet of Things (IoT), Artificial Intelligence (AI), security, hybrid, and enterprise applications, from 42 Availability Zones (AZs) across 16 geographic regions in the U.S., Australia, Brazil, Canada, China, Germany, India, Ireland, Japan, Korea, Singapore, and the UK. AWS services are trusted by millions of active customers around the world – including the fastest growing startups, largest enterprises, and leading biotechnology, pharmaceutical and medical device companies – to power their infrastructure, make them more agile, and lower costs. To learn more about AWS in biotech and pharma, visit

Articles in this issue

Links on this page

view archives of Life Sciences - Data Lakes: How to glean new insights from existing data