SHARE:
Data lakes: How to glean new
insights from existing data
3
exploratory analyses or use the data in machine learning
and predictive analytics, they need access to the raw form.
Restrictions on the types of data stored in warehouses
are an issue for companies interested in pulling data
from devices and the internet of things, too.
The warehouse model also hives off data into disparate
repositories that may run on different technologies,
making it very difficult to combine them for analysis.
Datasets generated in clinical trials are kept apart
from those gathered by manufacturing, marketing and
other functional areas. This is limiting their value.
"Historically, because systems were designed for
purpose, data would effectively feed a given intended
use. But the data strateg y wasn't such that the
findings from a specific set of signals in a clinical
trial could be correlated very easily across the entire
lifecycle of the drug from conception through how
it's used in market, reimbursed and the messaging
and positioning around it to the provider, payor
and patient communities," Mark Johnston, Global
Director of Healthcare and Life Sciences at Amazon
Web Services (AWS), said.
Recognition of thes e shor tcomings led to the
development of a new pipeline paradigm: Data lakes.
This approach stores data without first putting it
through the extract, transform and load process.
Instead, data generated from across the organization
flows directly into a centralized repositor y, only
stopping to be tagged to make it easy to find. Services
from vendors including AWS facilitate this process.
Users can access raw or previously-transformed
data. This frees people from having to start from
scratch every time but gives them the flexibility to
access raw data if the transformed version is ill-
suited for their needs. Importantly, everything is
kept in one place, making it easy to find the most
appropriate data.
"Data lakes allow you to store any type of data in any
format or correlative iteration. When you are looking
for datasets that are relevant to your questions, if the
data is properly catalogued, they are easily found
in a single repository. This single repository model
allows you to format, ingest and analyze these data
sets into appropriately tooled compute environments,"
Dario Rivera, Senior Solution Architect at AWS, said.
THE FIRST WAVE OF DATA
LAKE APPLICATIONS
Multiple biopharma companies have shown the value
of bringing disparate datasets into a central repository.
Merck delivered an early validation of the data lake
approach in 2013 when it set out to understand why
it was discarding a higher proportion of certain
vaccines than usual. The affected site had data on
all aspects of the manufacturing workflow, such
as minute-by-minute temperature readings from
across the facility and process control records for
each batch. What it lacked was a way to quickly
analyze these resources in search of an answer to
its discard problem.
That changed when the vaccine team loaded its
data into a platform r unning on AWS. Within
three months of starting to centralize its data on
Data lakes allow you to store any
type of data in any format