Life Sciences

Data Lakes: How to glean new insights from existing data

Issue link:

Contents of this Issue


Page 2 of 4

SHARE: Data lakes: How to glean new insights from existing data 3 exploratory analyses or use the data in machine learning and predictive analytics, they need access to the raw form. Restrictions on the types of data stored in warehouses are an issue for companies interested in pulling data from devices and the internet of things, too. The warehouse model also hives off data into disparate repositories that may run on different technologies, making it very difficult to combine them for analysis. Datasets generated in clinical trials are kept apart from those gathered by manufacturing, marketing and other functional areas. This is limiting their value. "Historically, because systems were designed for purpose, data would effectively feed a given intended use. But the data strateg y wasn't such that the findings from a specific set of signals in a clinical trial could be correlated very easily across the entire lifecycle of the drug from conception through how it's used in market, reimbursed and the messaging and positioning around it to the provider, payor and patient communities," Mark Johnston, Global Director of Healthcare and Life Sciences at Amazon Web Services (AWS), said. Recognition of thes e shor tcomings led to the development of a new pipeline paradigm: Data lakes. This approach stores data without first putting it through the extract, transform and load process. Instead, data generated from across the organization flows directly into a centralized repositor y, only stopping to be tagged to make it easy to find. Services from vendors including AWS facilitate this process. Users can access raw or previously-transformed data. This frees people from having to start from scratch every time but gives them the flexibility to access raw data if the transformed version is ill- suited for their needs. Importantly, everything is kept in one place, making it easy to find the most appropriate data. "Data lakes allow you to store any type of data in any format or correlative iteration. When you are looking for datasets that are relevant to your questions, if the data is properly catalogued, they are easily found in a single repository. This single repository model allows you to format, ingest and analyze these data sets into appropriately tooled compute environments," Dario Rivera, Senior Solution Architect at AWS, said. THE FIRST WAVE OF DATA LAKE APPLICATIONS Multiple biopharma companies have shown the value of bringing disparate datasets into a central repository. Merck delivered an early validation of the data lake approach in 2013 when it set out to understand why it was discarding a higher proportion of certain vaccines than usual. The affected site had data on all aspects of the manufacturing workflow, such as minute-by-minute temperature readings from across the facility and process control records for each batch. What it lacked was a way to quickly analyze these resources in search of an answer to its discard problem. That changed when the vaccine team loaded its data into a platform r unning on AWS. Within three months of starting to centralize its data on Data lakes allow you to store any type of data in any format

Articles in this issue

Links on this page

view archives of Life Sciences - Data Lakes: How to glean new insights from existing data