Water bodies of “Data”: A metaphor taken too far

Tapas Das
5 min readNov 1, 2022

--

Pic Courtesy: https://www.pexels.com/

In 2006, British mathematician Clive Humby coined the phrase, “Data is the new oil”. This analogy has been proven correct as data now powers entire industries and holds tremendous value in business decision making.

However, lately the “data-as-water” analogy as taken over the “data-as-oil” analogy in big way. From data lakes to data streams, the world of big data is “awash” (pun intended!) in water metaphors.

Don’t get me wrong. The “data-as-water” analogy makes sense for the most part. Like water, data is a resource that can be stored in static reservoirs (a.k.a. database or data stores), or allowed to flow from place to place (via ETL or ELT processes). Also, data can trigger various actions, just like a river turns a mill wheel or spins a turbine in a dam.

But lately, with the emergence of so many “data bodies of water”, one can’t help but realize that this metaphor has been taken a little too far. Many of us have heard of data lakes by now — but what about “data lake houses” or “data ponds” or “data puddles”? It’s hard to tell which of these terms refer to something substantive, and which are mirages.

So, let’s deep-dive (pun unintended!) into the 8 most common “data bodies of water”, and explore more about these metaphors.

Data Lake👍

A term that started it all. A data lake is a centralized repository that allows for storing all structured and unstructured data at any scale. We can store data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

The lake may exist purely for storage, or it may include a computational layer capable of performing analysis on the data it contains. Either way, the lake metaphor is apt. A data lake’s nearly infinite storage capacity means it can absorb a constant flow of data without filling up or overflowing, just like a real lake fed by a river.

Data Lakehouse👎

When data lakes were relatively new, they were used for storage exclusively. To perform analysis, we needed to copy the relevant data to a separate structure that usually ran on specialized hardware, called a data warehouse.

A data lakehouse is a data solution concept that combines elements of the data warehouse with those of the data lake. Data lakehouses implement data warehouses’ data structures and management features (e.g. concurrent reading and writing of data, schema support, etc.) for data lakes, which are typically more cost-effective for data storage. This eliminates the need to copy and transfer data.

While this does reflect a shift in methodology, ultimately “data lakehouse” is a marketing gimmick. Adding a distributed computational environment doesn’t really change what the data lake is — it just means there are new standards and software to access those datasets.

Data Swamp👍

In simpler terms, this is what happens when a data lake goes wrong — inadequate data governance, lack of metadata management, vast stores of completely irrelevant data that someone collected without having a real plan to do anything with it.

Some Data Swamps can be cleaned by using Data Curation and Data Governance to organize data sets. However, organizations are beginning to realize that if not done right all the time and effort spent building massive data lakes can be a fruitless exercise if data governance and management are not given due importance.

Data Stream👍

Of late, we use the term “streaming” so much, it’s easy to forget it’s a water metaphor, too. A data stream is a continuous flow of data with no beginning or end.

While the term is most often used to describe flows of raw data, such as clickstream data or sensor data from IoT devices, cleaned and processed data can be transmitted in a stream, too. Unlike static data sitting in data lake, streaming data must be processed or stored sequentially, record by record as it comes in.

Data River👎

Some data experts argue that a river is a better metaphor for modern data storage than a lake. A lake is generally static, whereas the flow of real-time data through modern organizations is dynamic, triggering various actions as it flows.

But given that we already have the term data stream to describe data in motion, this new term is largely unnecessary.

Data Ponds and Puddles👎

This is what we call a pool of data that’s smaller or more specialized than general-purpose data lake.

Data puddles are built with big data technology but intended for a specialized use case or one team, while a data pond is essentially a disorganized data lake, created by either pooling together several data puddles or by offloading data from a data warehouse onto a new platform.

This is one of those points where attempts to keep up the water metaphor dry up. There’s no need to call an Excel spreadsheet that’s not integrated with data lake a pond, a puddle or anything else — especially when no one agrees on the exact terminology.

Delta Lake👍

Created by Databricks and donated to the Linux Foundation, Delta Lake is an open source project created to re-engineer how a data lake works. Instead of writing data in an immutable fashion, Delta Lake allows to update and delete single records in data lake, as well as offering some additional benefits like reliability, security and performance — for both streaming and batch operations.

Data Glaciers and Icebergs👍

Many tech companies put their own twist on the water metaphor by naming their products after types of ice.

Apache Iceberg is an open table format for large analytic datasets (similar to a Delta Lake), and Amazon S3 Glacier is a storage class for long-term data cold storage (get it?).

Unlike some of the other terms on this list, these names are quite clever — and the products they describe are actually useful, too.

In Conclusion

As far as we’ve stretched the “data-as-water” metaphor already, it may still have further to go. There are many water-related words that haven’t been absorbed into the tech lexicon yet, and plenty of data-related phenomena that still need naming.

Below are few more suggestions, which hopefully tech organizations will pick up on to name next generation of data products.

  • Data Droplets
  • Data Waterfall
  • Data Sea
  • Data Wetlands

But these are in the future. For now, the above list is a comprehensive summary of all the data bodies of water you need to know — and a few that you don’t — to navigate the tricky waters of data-related conversations.

--

--