SHARE:
Data lakes: How to glean new
insights from existing data
4
the platform, the team had a conclusive answer
to why discard rates for one vaccine were higher
than expected.
Merck made the breakthrough by plotting data
from ever y vaccine batch ever produced at the
plant on a heat map. This revealed patterns that
led Merck to identify fermentation performance
traits that correlated closely to yield. Merck ran 15
billion calculations and 5.5 billion batch-to-batch
comparisons in its search for the answer.
Other companies followed Merck in setting up data
lakes to glean new insights from old data. Amgen
embraced the idea after its process development and
observational research groups asked for capabilities
that were beyond the scale and functions of its existing
data warehouses.
Members of the process development group wanted
to use the growing output of data from Amgen's
labs, production lines and bioreactors to optimize
processes. Similarly, statisticians and epidemiologists
on the observational research team had access to
ever more real-world evidence but lacked tools to
effectively mine the data for insights into the safety,
efficacy and economic value of Amgen's products.
Amgen responded to both situations by centralizing its
existing repositories to create a data lake. e biotech
then added the means for users to automatically spin
up environments with the tools they need to quickly
uncover insights in data housed in the central repository.
BUILDING AND HOSTING DATA
LAKES IN THE CLOUD
T h e f a st e x p ans i on i n t h e d at a av ai l abl e to
manufacturing teams and growing need for biopharma
companies to empirically demonstrate the value
of their products mean process development and
obser vational research groups are two of the big
beneficiaries of data lakes. They are far from the
only functional areas to adopt the model, though.
The emergence of population-scale sequencing
means data lakes are equally valuable to genomic
researchers.
"We have customers that have put together data
lakes of close to 20 petabytes of mostly sequencing
data," Patrick Combes, Global Technical Leader,
Healthcare and Life Sciences at AWS, said.
The size of these data lakes shows why the cost of
storage is critical to the concept. The whole point of
a data lake is that it houses all of an organization's
data. If concerns about cost force an organization
to start being selective about what goes into the
data lake, the value of the system diminishes as it is
less well equipped to answer unforeseen questions.
When coupled to the fact data lake technology needs
to be scalable, extensible and flexible, these cost
considerations mean the cloud is well suited to the
concept. The cloud offers low cost storage that scales
automatically in line with user needs. On premise
storage, in contrast, forces companies to estimate
their future needs and pay for more capacity than
they currently use.
The cost effectiveness of a cloud data lake is further
improved by using a technologies that handle
compute and storage separately, such as Amazon
EMR for the former and Amazon S3 for the latter.
This allows storage to be scaled independently of
compute capacity. As data lakes use more storage
Merck ran 15 billion
calculations and 5.5 billion
batch-to-batch comparisons in
its search for the answer