Life Sciences

Data Lakes: How to glean new insights from existing data

Issue link: https://read.uberflip.com/i/1182522

Contents of this Issue

Navigation

Page 3 of 4

SHARE: Data lakes: How to glean new insights from existing data 4 the platform, the team had a conclusive answer to why discard rates for one vaccine were higher than expected. Merck made the breakthrough by plotting data from ever y vaccine batch ever produced at the plant on a heat map. This revealed patterns that led Merck to identify fermentation performance traits that correlated closely to yield. Merck ran 15 billion calculations and 5.5 billion batch-to-batch comparisons in its search for the answer. Other companies followed Merck in setting up data lakes to glean new insights from old data. Amgen embraced the idea after its process development and observational research groups asked for capabilities that were beyond the scale and functions of its existing data warehouses. Members of the process development group wanted to use the growing output of data from Amgen's labs, production lines and bioreactors to optimize processes. Similarly, statisticians and epidemiologists on the observational research team had access to ever more real-world evidence but lacked tools to effectively mine the data for insights into the safety, efficacy and economic value of Amgen's products. Amgen responded to both situations by centralizing its existing repositories to create a data lake. e biotech then added the means for users to automatically spin up environments with the tools they need to quickly uncover insights in data housed in the central repository. BUILDING AND HOSTING DATA LAKES IN THE CLOUD T h e f a st e x p ans i on i n t h e d at a av ai l abl e to manufacturing teams and growing need for biopharma companies to empirically demonstrate the value of their products mean process development and obser vational research groups are two of the big beneficiaries of data lakes. They are far from the only functional areas to adopt the model, though. The emergence of population-scale sequencing means data lakes are equally valuable to genomic researchers. "We have customers that have put together data lakes of close to 20 petabytes of mostly sequencing data," Patrick Combes, Global Technical Leader, Healthcare and Life Sciences at AWS, said. The size of these data lakes shows why the cost of storage is critical to the concept. The whole point of a data lake is that it houses all of an organization's data. If concerns about cost force an organization to start being selective about what goes into the data lake, the value of the system diminishes as it is less well equipped to answer unforeseen questions. When coupled to the fact data lake technology needs to be scalable, extensible and flexible, these cost considerations mean the cloud is well suited to the concept. The cloud offers low cost storage that scales automatically in line with user needs. On premise storage, in contrast, forces companies to estimate their future needs and pay for more capacity than they currently use. The cost effectiveness of a cloud data lake is further improved by using a technologies that handle compute and storage separately, such as Amazon EMR for the former and Amazon S3 for the latter. This allows storage to be scaled independently of compute capacity. As data lakes use more storage Merck ran 15 billion calculations and 5.5 billion batch-to-batch comparisons in its search for the answer

Articles in this issue

Links on this page

view archives of Life Sciences - Data Lakes: How to glean new insights from existing data