Data lakes and swamps, oh my

I was lamenting to my friend and fellow MVP Shamir Charania (blog | Twitter) that I didn’t have a topic for this week’s blog post, so he and his colleague suggested I write about data lakes, and specifically Azure Data Lake.

What is a data lake?

This is what Wikipedia says:

A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

A data swamp is a deteriorated data lake either inaccessible to its intended users or providing little value.

In my opinion the Wikipedia definition has too many words, so let’s rewrite it:

A data lake is a repository of enterprise data stored in its original format. This may take the form of one or more of the following:

structured data from relational databases (rows and columns);

semi-structured data (CSV, log files, XML, JSON);

unstructured data (emails, documents, PDFs); and

binary data (images, audio, video).

(I thought the term “data swamp” was a joke, but it’s 2018 and nothing shocks me anymore.)

If that definition of a data lake sounds like a file system, I’d agree. If it sounds like SharePoint, I’m not going to argue either.

However the main premise of a data lake is a single point of access for all of an organization’s data, which can be effectively managed and maintained. To differentiate “data lake” from “file system” then, we need to talk about scale. Data lakes are measured in petabytes of data.

Whoa, what’s a petabyte?

For dinosaurs like me who still think in binary, a petabyte (referred to by some as pebibyte) is 1,024 terabytes (tebibytes), or 1,125,899,906,842,624 bytes (yes, that’s 16 digits).

In the metric system, a petabyte is 1,000 terabytes, or 1,000,000,000,000,000 bytes.

No matter which counting system we use, a petabyte is one million billion bytes. That’s a lot of data.

Who, what, how?

Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, Outlook.com) are managing data stores measured in petabytes. On a daily basis these organizations handle all sorts of structured and unstructured data.

Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools, and even created new technologies, to manage data of this magnitude in a field called big data.

The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns us: if you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong.

Big data is where we hear about processes like Google’s MapReduce. The Apache Foundation created their own open-source implementation of MapReduce called Hadoop. Later, Apache Spark was developed to solve some of the limitations inherent in the MapReduce cluster computing paradigm.

Hadoop and other big data technologies can be thought of as a collection of tools and languages that make analysis and processing of these data lakes more manageable. Some of these tools you’ve already heard of, like JavaScript, Python, R, .NET and Java. Others (like U-SQL) are specific to big data.

What is Azure Data Lake?

From a high level of abstraction, we can think of Azure Data Lake as an infinitely large hard drive. It leverages the resilience, reliability and security of Azure Storage you already know and love. Then, using Hadoop and other toolsets in the Azure environment, data can be queried, manipulated and analysed in the same way we might do it on-premises, but leveraging the massive parallel processing of cloud computing combined with virtually limitless storage.

Note: Microsoft is not the only player in this space. Other cloud vendors like Google Compute (GC) and Amazon Web Services (AWS) offer roughly equivalent services for roughly equivalent prices.

Our new definition

With all of that taken into consideration, here is my new definition for “data lake”:

A data lake is a single repository for all enterprise data, in its natural format, which can be effectively managed and maintained using a number of big data technologies.

Share your big data story with me on Twitter at @bornsql.

Photo by Kyle Johnson on Unsplash.