Data Science In Production Episode 5: Data Lakes for Data Science
In this podcast we explore the concept of data lakes. Data lakes are a great data science enabler, but only if they are built right. In this podcast we explore the various ways of designing a data lake to support data science.
I discussed my idea for layers in a lake. Here is a summary of those layers:
RAW – RAW represents and immutable data store. This layer is designed for data to be landed and never changed. Access to this layer is heavily restricted to only those who need to see the data unprocessed and applications involved in the movement of data to the RAW layer. There should be an Admin account for reading and writing to this layer and a reader account for readers. Admin group need to be locked down to development staff and applications.
BASE – BASE represents that the cleaned and processed data abased on the data contracts. This could be entirely populated using automation driven by metadata.
CURATED – CURATED is the store for data which has been manually developed. This is the combination of multiple files in to a single file.
LAB – LAB relates to an individual area for development staff. This is an isolated area that a data scientist can use for the storage of data, trained models, notebooks etc as required by their existing project.
EXTERNAL – External is for the placement of files which are to be distributed to other vendors or applications.
MODEL – This is the layer for serialised models.
LIBRARY – LIBRARY is intended to be part of the data lake and would hold compiled code, templates and metadata for processing.