Data lake or how to not sink
In the previous article, we talked about data warehouse theory: what it is, its pros and cons and why your product needs it. I hope you understood the importance of having a data warehouse for your product, are ready to build it and have all the required resources.
So where to begin? To be honest there are myriad ways to do it and the way depends on different conditions: your data, expertise of your team, budget, etc. I cannot tell you all possible approaches but I will describe the most standard way.
In this article we will talk about a data lake. The building process consists of 3 basic steps.
Step 1. Identify data sources
The first step is to understand what data will be stored in the data warehouse. Define all places where your product’s data lays and the types of this data.
It might be:
- Application databases
- Analytics systems
- Reports from partner products
- CRM systems
- User files
- Other sources
Step 2. Create a data Lake
You need to put all your data from the sources you defined in one place for future processing and moving to a data warehouse. This place is called a Data Lake.
Data Lakes are a storage of raw, unstructured and not processed data from product’s data sources.
What database storage should you use to contain all this data? There are two possible options: analytics database and object storage.
The choice basically depends on these factors:
- Data you have
- Expertise of your team
- Security requirements
But the most important factor is data and its type. Let’s see how we can figure out what data lake you should select.
You can select an analytical database as data lake storage if your data is well structured (e.g it’s from relational databases). Examples are RedShift, Snowflake and BigQuery.
These databases are optimized to efficiently store, quickly retrieve and process the data. We will talk about them in detail in the next article.
Also I want to highlight that an analytics database is more expensive than traditional databases (except Oracle, of course). For example the minimum monthly price for Amazon Redshift is about 200$, and Postgres is about 10$, and your team might not have experience to work with it.
You might ask: what is the difference between application database and data lake in this case? Data lake stores all data your product generates from all sources. It means that data might be duplicated, redundant, rubbish and be not necessary in the near future. And it’s ok because the main rule in the big data world — store everything you can.
In case your data cannot be structured (e.g. data has different types: json reports, music files, pictures, telemetry) you should use object storage as a data lake. For example S3, HDFS, or Google storage.
This type of storage is cheap, easy to use and more flexible because data is just files in different formats (database backups, csv reports, jsons from IoT devices).
Step 3. Setup your data pipeline
So we defined the source of our data and the destination (data lake). How can we move data from former to latter? This is where the data pipeline comes into play.
It is an engineering process that moves data between storages and consists of ETL jobs. ETL is an acronym for Extract, Transform and Load. An ETL job is a process that extracts data from one data source, transforms it somehow and loads into another.
As I said above, a data lake stores unprocessed data so the Transform step can be omitted, you should create an EL job in this case.
There are a bunch of tools that can help you to build your data pipeline. And all these tools can be grouped into four categories based on their infrastructure and supporting organization or vendor:
- Enterprise-grade (IBM DataStage, Fivetran)
- Open-source (Singer, Hadoop)
- Cloud-based (AWS Glue, Google Cloud Dataflow)
- custom ETL tools (in case existing solutions do not fit your requirements)
We will return back to discussion about ETL jobs in the next article.
Don't make a data swamp!
Earlier I said that you should put all your product’s data into a data lake. But there is a danger that might happen with your data lake if any data gets there uncontrollably.
One day you will realize that you don’t understand what the hell is going on in it. There will be so much different data from unknown sources that your data lake turns into a data swamp — a useless and resource-consuming storage quagmire.
You should create a control process in your team/company (name conventions, standards, table of sources and destinations, etc) to avoid data mess in your data lake. This process is called data governance.
Let’s recap what we learned. Steps to build our own data lake:
- Define data sources you have. It will help you to understand your type, format and volume of data and to select data lake storage best fits for you.
- Set up a data lake.
- Set up a data pipeline to move data from sources to your data lake.
And don’t forget to use the data governance process to avoid data lake’s metamorphosis to data swamp. Next time I will talk about the data warehouse stage. Stay tuned!