How (and why) we build marketplace data pipelines — Part 1: Data Warehouses

We live in a world that consists of data. Messages, purchases, reports, telemetry, even your thoughts are data. And rich is that company which understands data’s importance and uses it properly (hi Facebook).

This is the first article of a series about company data pipelines. In this series I will explain:

  • WHY you should build a company data warehouse (👈 we’re here)
  • HOW you should prepare for it
  • HOW you can build it
  • HOW you can use it.

Let's dive in to data pipelines ⛲️

Why do you need a data warehouse?

Let’s start from the definition of a data warehouse. It is a database that stores all your company data:

  • product business data,
  • data from CRMs,
  • data from analytics systems (Google Analytics, Amplitude, etc)

And it is optimized for querying this data as quickly as possible.

data-pipeline.png

So why do you need it? There are a couple of reasons.

All data is stored in one place

When all your product data is stored in one place, a data analyst (or Business Analyst, Product Manager) has the ability to make queries that will use data from different sources.

For example, I want the following report: how many users that have flag ‘Reengaged’ in our CRM have viewed a particular website section? And I would like to group this cohort in the report by country.

Without a data warehouse I’d have to:

  1. go to the CRM system,
  2. grab users with the flag ‘Reengaged’,
  3. go to the application database and take users’ country,
  4. and go to the analytic system to get users who viewed this section.

Not so easy, right?

But with a data warehouse you merely write the query and get the report in 10 seconds because all the information you need is already in one place.

Separated storage from the product’s database

Analytics queries that a data analyst runs usually have to process a huge amount of data. Let’s stay with our previous example — to get all the views of a certain page section joined with user location.

One query might freeze your application for a couple of minutes if the application shares a database with data analysts. That’s why separate data stores are a big advantage in product analytics.

The three main advantages are:

  1. Data warehouse databases are optimized for heavy computations. It takes far less time to process complicated queries in comparison with regular databases that are used to store application data.
  2. It will not freeze your application because it's a separate db. Obviously, such a freeze would have a negative impact on UX.
  3. Developers, application users and analytics can work independently with data.

Hypothesis testing

A data warehouse will help you work with your product’s hypotheses. Which one is confirmed and which failed.

A simple example — you think that yellow t-shirts sell better in the summer season. But after running the query you get a result that says ‘The month with the highest yellow shirt sales was October’.

This approach teaches you that all your ideas are not truth but hypotheses and only data can confirm or refute it.

Find insights and predict the future

Insights are knowledge that is not obvious at first glance. For example, you have a product that works with bank loans or credits. Over time, you found that middle-aged people do not repay credits on time at a higher percentage than other age segments.

It’s an insight for you because you thought that people of this age are able to return money easily. Hence you will increase the interest rate for this user segment to not lose money.

Getting insights in general will help you constantly improve your product by using strategies that work and are confirmed by data. And change those that do not.

Also, you can predict users behavior in the near future using data. For example, if you know how many sales you had over the last 10 black Fridays, you will be able to approximately know how many you will have this year and what preparations you have to do.

Track metrics and indicators

You might have product metrics that you track on a daily/weekly basis. A data warehouse can calculate them for you and notify you if something went wrong.

For example, you released a new feature and after that your sales decreased compared to last week.

When is building a data warehouse a bad idea?

A data warehouse is a great tool to understand what happens with your product.

But I’d be lying to you if I say it’s something that every product should have. In this section we’ll outline when having a data warehouse is not a good idea.

First of all, you should understand that a data warehouse is not a cheap solution for product development. You have to pay for data storage (data lake, data warehouse), ETL jobs computations, plus additional monitoring of data pipelines.

Also you will pay data engineers to set up data pipelines and support them. A data pipeline is a workflow that moves data from product database to data warehouse with all the transformations this requires.

And don’t forget about hiring data analysts. Of course, business analysts might do analytics work. But as your product grows you will need a person or team who works with data only to find insights, observe metrics, etc.

A data warehouse is an expensive instrument that does not fit every product. It requires a lot of resources and only medium or large products can afford to build it — and stand to benefit enough from it to justify the costs.

For example, products that are in the MVP stage. The main goal for MVP products is to release as quickly as possible and validate demand on the market. A data warehouse looks like overengineering in this case.

There is not so much data at this stage and it’s better to spend money on a marketing campaign and use free analytic tools to assess growth metrics.


Summary

A data warehouse is a solution that every medium or large product should have. It helps your team have all your product data in one place, quickly fetch it, find insights, build product predictions and much much more.

But it’s not free and you should be ready to spend money and people’s working hours to build and keep it working.

Stay tuned, part 2 — How to prepare your data lake — is coming soon 🚰

You might also like
4 milestones of shaping your product value

4 milestones of shaping your product value