LakeHouse | Data Warehouse | Data Engineering

Warehouse or Lake House? What to build and what actually matters?

A bit of a ramble of thoughts realising the trends of data engineering

Atharva Inamdar
5 min readMay 4, 2024

This article is a bit of a dump of my thoughts on some of the trends I have noticed around the technologies and infrastructure patterns that have evolved from the last 10 years or more. A lot of this my opinion and triggered by one single question in a recent interview: Given some reporting needs of a business, can you tell us whether you would choose to implement a data warehouse or a lakehouse and why?

For a split second I did think about responding with “Microsoft Excel” as the choice, having always laughed at many mentions in several companies around use of Excel for storing some critical data and use of Pivot tables. Then I realised, I needed a job.

In the Beginning

Let’s start with Data Warehouse. This is now termed as traditional technology. It refers to a single RDMS optimised for a dimensional data model. In the cloud services think of Amazon Redshift or MS SQL Server. Essentially a database (singleton or distributed). A lakehouse is simple a warehouse built on a data lake instead of a database. The main motivation being using the same data lake storage for warehouse model as well as raw and other non relational data storage. Otherwise, pretty much nothing conceptually different.

Now what makes a tool behave like a data warehouse? A data model that is designed for reporting and BI needs. A dimensional data model. the model is designed to minimise joins between entities so that a user can pivot on any dimensions and get the results quickly. The model focuses on pre-computing all the aggregations so that reads are fast. Make note of a couple of things here: precompute aggregations and fast reads.

Architecture Evolution

Let’s take a step back and look at the architecture of these technologies. Initially, these warehouses were built using regular relational databases such as SQL Server, MySQL, DB2, etc. Then as need for storing more and more data was realised, these were scaled vertically (more CPU, more Memory on single instance). This had two bottlenecks, increased costs, and a physical limit on single server capacities. So distributed computing was adopted and popularised by Hadoop. A cluster of servers with local storage contained all the data needed and distributed compute was used to process shards/partitons of data on each server before collating a single answer back to the user. The split of storage and compute layer had started. This was still during the days of on-premesis servers.

The cloud commoditised and simplified infrastructure for the end users. This paradisgm was transposed directly into cloud with even larger clusters of Hadoop. Soon cloud introduced blob storage that was “infinitely” scalable. This allowed storage to be moved from servers to blob stores which were accessed similar to network storage. This enabled the split of scalability of compute and storage to be independent. We didn’t need new servers in the cluster to just store more data, the same number of servers became compute only with storage scaled serparately. This allowed two things to happen; combining hot and cold storage without neededing to load/archive data frequently and, enable decoupling of data formats and compute engines.

Cloud blob stores became data lakes. Since the storage was separated from compute engines, it needed separate governance, metadata, and control. This sprinkling of governance features generated a Data Lake. Compute was diversified similar to databases, Spark, Flink, Snowflake, Trino, etc. Now combine these two, for example, Spark + S3, is essentially a database engine. The storage being further away from compute didn’t align with the original idea of colocating compute with data to speed up processing. So now the vast storage needs to be brough to the compute, welcome caching and warm compute pools. An ever running small pool of compute is relatively cheap and allows “instant” responsive access. This provides a place to cache some frequently access data too making it look like fast access to the lake/blob store. So now there is slightly tighter integration between storage and compute again. Now, I don’t think it will fully reverse, the benefits of loosely coupling compute and storage is too high. However, there are valid cases for many many businesses to stick with traditional data warehouse technologies based on databases.

Compute

Distributed computing is difficult. Running Hadoop, Spark, Flink, Trino at large scale is difficult. As users of these technologies, we want to forget about the cluster and focus on writing the analytics code only. Very few of engineers are interested in running clusters and managing them day to day. This change in focus is leading to all the cloud providers creating “serverless” or fully managed services for these technologies.

Storage

On the storage side, cloud infrastructure isn’t the only change. There are the data formats that have emerged: Parquet, ORC, DeltaLake, Hudi, & Iceberg to name the major ones. If you read their feature lists, they are very similar. Infact Parquet is used by DeltaLake, hudi and Iceberg to add functionality like ACID transcations, updates, i ndexing, and partitioning. Does this sound familiar? We’re back to designing and creating a database with one difference, decoupled, distributed architecture. To me this means more headache to run these technologies. This is added complexity. Previously we, engineers just needed to think about table design and specifing the right partitions; now we need to think about configuring the data format and indexing, still the partitioning on blob stores, and additionally, compatibility of these formats with the compute engines. Thankfully, if you choose one of the major formats, they will be supported by the major compute engines. Engineers now have to understand the “internals” of the database to use the latest tools.

The Benefits

So what has the industry achieved in evolving this way? Scalability and flexibility. These technologies can handle multiple orders or magnitude of data than before. They can handle variety of data, not just tables, but hierarchical, graphs and audio video data. Flexibility comples mostly in the compute layer. Able to use a variety of server sizes, scale them up and down, vertically or horizontally. Additoinally, the data can be processed using different engines, i.e. Spark, Trino, Flink, etc. This does require rewriting code but means that different users can choose different engines that are suitable for their needs for processing the same data format.

My Answer

Back to the interview question on data warehouse and lakehouse, When to choose each? Well “It depends”. One thing is for sure, in any case, you need to ensure you model the data correctly as per your requirements. These technologies don’t let you get away with a crappy model and crappy data quality. Understand the current needs and guestimate the future growth, balance the engineering effort & cost and make a choice.

The Real Answer

An Excel Workbook with a pivot table (or two) stored on a SMB file share running on desktop under someone’s desk in the office. (:

--

--

No responses yet