SHARE:
Data lakes: How to glean new
insights from existing data
5
than compute capacity, the use of Amazon EMR and
S3 frees companies from paying for more compute
capacity than they need.
AWS customers have realized these benefits when
setting up petabyte-scale data lakes that combine
in-house sequencing data with publicly-available
genomic repositories, annotation information,
phenotype data and other resources. This pooling
of data creates lakes with the scale to identify rare
variants and the context to start understanding the
significance of sequencing results.
Companies are also using data lakes at the opposite
end of the biopharma value chain, for example to
analyze how to improve adherence to drug regimens.
Having access to a pooled repository of data allows
companies to ask questions about what actions
improve adherence.
The answer may differ depending on the drug and
nature of the adherence problem. But the process
of analyzing pooled data to identify problems and
potential responses remains constant, as does the
impact of improving adherence on health outcomes
and company financials.
HOW TO IMPLEMENT DATA LAKES
The barriers to setting up data lakes have come down
since early pioneers such as Merck used the approach.
Initially, companies had to create the structure,
metadata system and governance of their data lakes
from scratch. Mistakes at this stage have serious
implications. Failure to marry metadata to an effective
search function makes it difficult to locate data.
This is a common enough problem for the resulting
hard-to-search repositories to have their own name:
Data swamps.
Data lake solutions from vendors such as AWS have
eliminated this danger by ensuring websites point to
where datasets are and describe what they contain.
These solutions also simplify security and compliance
by using the strong safeguards and administrative
controls built into cloud services.
The upshot is biopharma companies can now create
secure and searchable data lakes in minutes.
Companies that are seizing this opportunity recognize
it is insufficient to simply generate data across their
businesses. They know it is also essential to be smart
about how that data is captured, characterized, shaped
and given meaning.
Data lakes are an enabler of this way of thinking. It is
a way of thinking that goes beyond generating data
to answer known questions today. Instead, it gathers,
organizes and explores data to break new ground,
answering questions nobody knew to ask and making
discoveries for which nobody knew to look. l
For over 10 years, Amazon Web Services has been the world's most comprehensive and broadly adopted cloud platform.
AWS offers over 90 fully featured services for compute, storage, networking, database, analytics, application services,
deployment, management, developer, mobile, Internet of Things (IoT), Artificial Intelligence (AI), security, hybrid, and
enterprise applications, from 42 Availability Zones (AZs) across 16 geographic regions in the U.S., Australia, Brazil, Canada,
China, Germany, India, Ireland, Japan, Korea, Singapore, and the UK. AWS services are trusted by millions of active
customers around the world – including the fastest growing startups, largest enterprises, and leading biotechnology,
pharmaceutical and medical device companies – to power their infrastructure, make them more agile, and lower costs. To
learn more about AWS in biotech and pharma, visit https://aws.amazon.com/health/biotech-pharma.