Data Warehouse with Snowflake 101 for Data Engineering
Learn how big companies use Data Warehouse to make data-driven decisions
You use applications like Instagram, Facebook, and Netflix. You even know that these companies collect your data. Everyone knows about it; everyone talks about it. These companies collect the data to make better decisions, understand customers, and improve the business process.
But do you know how these things happen from a technological standpoint?
In this blog, I will give you a complete understanding of it.
The growth of data is massive and exponential. As for the recent report, 90% of the worst data was generated in the last 2 years, and this report is just 2 years old.
If you look at the bigger picture, you use your phones and computers daily. The amount of data that is getting generated and processed daily is massive.
This data generally gets stored on relational databases like PostgreSQL or as a text file on some file-based storage system.
But, in the end, all this data gets stored in one place. It is something called a Data Warehouse.
Data warehouses are the system specially built for analytical workloads. So when you want to store a huge volume of data and read them in bulk, you can easily do that using a data warehouse.
You can also find answers to questions such as:
How much time did this user spend on this page today compared to yesterday?
How did we do on this particular product sale this year compared to last year's?
These questions can be easily answered within seconds using a data warehouse.
So, when companies like Google and Facebook started growing, they started collecting massive volumes of data and wanted to process this data and find valuable insights from it.
Those were the days when very few people had access to the internet so the amount of data that used to get generated was little, and it was easy to process them using traditional technologies.
But as more and more people started getting access to the internet, the data began growing at an exponential rate.
At this time, the traditional data warehouse technologies couldn’t handle the huge volume and the speed at which the data was generated.
They had the architecture like this:
This is the Shard Disk Architecture where you have one disc and multiple users trying to connect via some network.
The other architecture was like this
Where you have shared databases, but to run the query, you have to use the distributed query across different nodes.
In a nutshell, these technologies couldn’t handle the generated data's speed and size.
Traditional data warehouse systems started to struggle.
They also required significant time and resources to scale.
Database performance was the big issue.
If you wanted to process a huge volume of data, it might take days or even weeks.
And more than that, the cost of managing all of these things was expensive.
All of these data warehouse technologies only supported limited data types
So if you wanted to store the data, you had to make your data into a structured format. So we have something called ETL (Extract, Transform, Load), where you extract data from multiple places, do some transformation, and then load the structured data onto the data warehouse.
So you have to do this entire processing before loading your data onto the data warehouse.
So all of these technologies couldn't handle the new age of data, and we needed something modern. And this is where the Snowflake database comes into the picture.
Snowflake is a new type of data warehouse available entirely on the cloud. The cloud means you are using someone else's computer.
So, in traditional data warehouses, you had to buy your hardware, make sure everything scales properly, and update the software. Even if you rent the server from someone else, you have to manage most of the things by yourself.
But the concept of the data cloud changed the entire game. Now, you don't have to worry about all of these things.
You just need to focus on your business side and make sure how you process your data.
What's cool about Snowflake is how it processes and stores your data. It keeps your data storage and your computer layer separate so that businesses can store more and more data and also efficiently process this data.
This is the architecture of Snowflake.
At the bottom: we have the data storage layer where all the data gets stored.
Middle: Compute layer where we can allocate resources to process our data and run queries.
On Top: Cloud service layer where you can access different features like authentication, and security, and manage the overall infrastructure.
For example, if you have multiple teams working within the organization, you can create something called virtual warehouses where you can allocate different sizes of CPUs and RAM for different teams.
And based on the requirements, they will only use the allocated resources.
This way, you will not face the performance challenges. And even if you need more resources, it will scale all of the systems.
This makes Snowflake one of the most powerful database technologies available.
There are so many new things that Snowflake has added based on the problems they see in the market.
If you want to store structured or unstructured data, you can easily do that.
If you want to query data that is stored at some different location, you can also create tables on top of it and start querying it. You don't have to move your data from that location to your location. Y
You don't require the ETL part. You can directly load your data onto Snowflake and then do the transformation using Python code or SQL code
One of my favorite features on Snowflake is something called a Snowpipe.
So, if you want to create the data pipeline based on some event, let's say your data is coming onto Amazon S3. So, whenever any new file gets uploaded, the Snowpipe will get triggered, and it will directly store your data in the Snowflake table.
What Snowflake does is that they work with large enterprises and try to understand what the real problem is. And based on that, they try to solve these problems by building the right features. And they have hundreds of features that you can explore.
End-to-end example to understand Snowflake in action
Whenever you want to learn anything, the first step is not to find courses, resources, or books. The first step is to go to the website and create your account.
Go to the Snowflake Page and Create Your FREE Account
They provide three trials for 30 days, and you get $400 worth of credit. All you have to do is fill in your information then they will ask you to choose the Snow Edition.
You have three options: Standard, Enterprise, and Business Critical.
We will go with the Business Critical version because that provides most of the features that we want.
You can choose your cloud provider. So, if you're working in a company and they have their existing infrastructure on a particular cloud, you can do that.
I'll just go with AWS. And region, you can select the nearest region as per your location.
Once you do that, you will get the activation mail in your email. So, you can go to your registered email, and here you will find all of the information. Just click on the "Click Activate." add your Username and Password, and you’ll see this
Worksheet: It's where you write and run SQL queries.
Streamlist: Shows what's happening in your Snowflake account in real-time.
Dashboard: Makes data easy to understand with visual charts and graphs.
Apps: Ready-made tools to do more with your data.
Marketplace: Place to find and install extra data tools and services.
Click on Worksheet - > Create new SQL Worksheet
On the left side, you’ll see Databases/Schema/Tables
On Top Right, you can select Data Warehouse Compute run query
On “No Database Selected” about the query editor, you can select the specific database to work with
Select the right schema and run your first SELECT query
This will scan the entire table and give you the final output.
It has around 15 million rows available. You can also get more understanding about the query, like how much time it took. If you want to debug that, you can just click on this query ID, and you will get the information about each and everything that this query did in the backend.
Running simple aggregation query
SELECT SUM(L_quantity) AS total_quantity_sold FROM
"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF100"."LINEITEM"
Running simple JOIN
SELECT c.c_name, o.o_orderkey, o.o_orderdate, o.o_totalprice
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF100"."ORDERS"
JOIN "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF100"."CUSTOMER" c ON 0.0_custkey = c.c_custkey;
Joining two tables based on common keys, the order table and the customer table, and then selecting these four columns.
Creating Warehouse
Go to Admin → Warehouses and Click on “+ Warehouse” at top right
Here you can create your custom warehouse as per your requirements
Give name, Select Type and Size
Advance options to Auto Resume and Auto Suspend warehouse as per the need
Snowflake has many different features:
Snowpipe: Automatically ingests data into Snowflake from various sources like files or streams, ensuring continuous data loading without manual intervention.
Time Travel: Allows you to query data at specific points in time, facilitating historical analysis and debugging by accessing past versions of your data.
Restore: This enables you to restore your entire database or specific objects to a previous state, providing data recovery and rollback capabilities in case of accidental changes or data corruption.
Data Sharing: Allows secure and controlled sharing of live data across different Snowflake accounts, enabling collaboration and data monetization without data movement.
Materialized Views: Improve query performance by pre-computing and storing the results of frequently used queries, reducing computation time for repeated analyses.
Automatic Clustering: Optimizes data organization and storage by automatically clustering data based on usage patterns, improving query performance and reducing costs.
Resource Monitors: Helps manage and allocate resources by setting limits and priorities for different workloads, ensuring optimal performance and cost efficiency.
Zero-Copy Cloning: Quickly create lightweight clones of databases or tables without copying data, saving time and storage costs for development, testing, and analytics purposes.
This was a quick overview of Snowflake for Data Engineering, how to get started, and run basic queries.
If you are interested in learning Snowflake in detail then I have a specialized course: Data Warehouse for Data Engineering with Snowflake
Building strong SQL and Database Fundamentals
A systematic approach to solving any SQL problem
Building Data-Model and Writing Optimized SQL queries
Writing advanced SQL queries
Understand SQL for Data Engineering with Python
Data Extraction, Transformation, and Loading using SQL
Learn everything by building projects (4 Projects)
2 End-To-End Projects here
Thank you for reading :) If you learned something, hit that LIKE Button!