Vol #7 | What is a Cloud Data Lake

Jun 11, 2022

Hello to all my readers! I hope you all are having a great weekend.

Today, I will discuss about data lakes that are implemented using cloud technologies.

A cloud data lake is the backbone of every data ecosystem. It is the primary data storage for all your data, for all your workloads - from AI to BI, everything!

green trees near lake under white clouds and blue sky during daytime — Photo by Lauren Lopes on Unsplash

What is a cloud data lake?

A data lake is a repository where you can store all your data, including structured,semi-structured and unstructured data.

When such a data lake is created using cloud object storage services, it is called a cloud data lake.

E.g

A data lake implemented using AWS S3, Azure ADLS, or Google Cloud storage can be considered a cloud data lake.

Before the emergence of the cloud, data lakes were implemented using Hadoop technologies like HDFS (Hadoop Distributed File System)

Why do we need a cloud data lake?

A cloud data lake has several benefits, as listed below

A central data repository to store all kinds of data
As these use cloud object storage, they are highly durable and available
Easily integrates with other cloud computing services to access and query data
Highly secure and can use built-in encryption functions and services to encrypt data at rest.
Cheaper as compared to Data-warehouse platforms
Supports all workloads - BI Reporting, Adhoc Queries, AI/ML Workloads, Analysis etc

How to implement a cloud data lake?

A cloud data lake can be implemented using any cloud object storage service.

For E.g. Let's see how to implement a cloud data lake in AWS.

Here are the steps to build a data lake using AWS S3.

Create S3 buckets for storing your data.
Create folders (or other buckets) for storing various data layers like Raw/Curated/Business.
Ingest data from sources ("as is" data) into your raw(Bronze) layer.
Perform data quality checks and additional business transformations before storing data in the curated(Silver) layer.
Perform required business-specific aggregations and modelling to store data in your Business (Gold) layer.
Apply required encryption and access policies on these buckets to secure the data.
Apply lifecycle policies to archieve data based on time and relevance.

What are the current trends?

In the last decade, data ecosystems have evolved significantly. Industries first built data warehouses, then data lakes in their data centres. With the rise of the cloud, many enterprises started implementing cloud data lakes.

Databricks (a data and AI company) offers a platform to implement a Lakehouse which is getting a lot of attention from many enterprises.

In its simplest forms, Lakehouse means to create a data lake using the modern data stack to leverage it for supporting AI to BI use cases.

We will have to wait and see if Lakehouse becomes the default approach for implementing a central data repository for any organisation.

That's it for today. I hope this article helps you to understand cloud data lake and its benefits.

Thanks, and see you all next week with another article about "Cloud & Data."

Thank you for reading Data & Cloud. This post is public so feel free to share it.

Data & Cloud by GT

Discussion about this post