Automated World: December 2021

Thursday, December 9, 2021

Day 4: Virtual Warehouse(Part-1)

Virtual Warehouses, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. A warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform the following operations in a Snowflake session:

Executing SQL SELECT statements that require compute resources (e.g. retrieving rows from tables and views).
Performing DML operations, such as

Updating rows in tables (DELETE , INSERT , UPDATE).
Loading data into tables (COPY INTO <table>).
Unloading data from tables (COPY INTO <location>).

Overview of Virtual Warehouse:

Warehouses are required for queries, as well as all DML operations, including loading data into tables. A warehouse is defined by its size, as well as the other properties that can be set to help control and automate warehouse activity.Warehouses can be started and stopped at any time. They can also be resized at any time, even while running, to accommodate the need for more or less compute resources, based on the type of operations being performed by the warehouse.

Impact on Credit Usage and Billing

There is a doubling of credit usage as you increase in size to the next larger warehouse size for each full hour that the warehouse runs; however, note that Snowflake utilizes per-second billing (with a 60-second minimum each time the warehouse starts) so warehouses are billed only for the credits they actually consume.

Note: For a multi-cluster warehouse, the number of credits billed is calculated based on the multi-cluster warehouse size and the number of warehouses that run within the time period. For example, if a 3X-Large multi-cluster warehouse runs 1 warehouse for one full hour and then runs 2 warehouses for the next full hour, the total number of credits billed would be 192 (i.e. 64 + 128).

Impact on Data Loading

Increasing the size of a warehouse does not always improve data loading performance. Data loading performance is influenced more by the number of files being loaded (and the size of each file) than the size of the warehouse.

Tip: Unless you are bulk loading a large number of files concurrently (i.e. hundreds or thousands of files), a smaller warehouse (Small, Medium, Large) is generally sufficient. Using a larger warehouse (X-Large, 2X-Large, etc.) will consume more credits and may not result in any performance increase.

Impact on Query Processing

The size of a warehouse can impact the amount of time required to execute queries submitted to the warehouse, particularly for larger, more complex queries. In general, query performance scales with warehouse size because larger warehouses have more compute resources available to process queries.

If queries processed by a warehouse are running slowly, you can always resize the warehouse to provision more compute resources. The additional resources do not impact any queries that are already running, but once they are fully provisioned they become available for use by any queries that are queued or newly submitted.

Auto-suspension and Auto-resumption

A warehouse can be set to automatically resume or suspend, based on activity:

By default, auto-suspend is enabled. Snowflake automatically suspends the warehouse if it is inactive for the specified period of time.
By default, auto-resume is enabled. Snowflake automatically resumes the warehouse when any statement that requires a warehouse is submitted and the warehouse is the current warehouse for the session.

What is a Multi-cluster Warehouse?

By default, the size of a virtual warehouse determines the compute resources available to the warehouse for executing queries. Each warehouse is a set of compute resources. As queries are submitted to a warehouse, the warehouse allocates resources to each query and begins executing the queries. If sufficient resources are not available to execute all the queries submitted to the warehouse, Snowflake queues the additional queries until the necessary resources become available.

With multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, additional warehouses to make a larger pool of compute resources available. A multi-cluster warehouse is defined by specifying the following properties:

Maximum number of warehouses, greater than 1 (up to 10).
Minimum number of warehouses, equal to or less than the maximum (up to 10).

Additionally, multi-cluster warehouses support all the same properties and actions as single warehouses, including:

Specifying a warehouse size.
Resizing a warehouse at any time.
Auto-suspending a running warehouse due to inactivity; note that this does not apply to individual warehouses, but rather the entire multi-cluster warehouse.
Auto-resuming a suspended warehouse when new queries are submitted.

Maximized vs. Auto-scale

You can choose to run a multi-cluster warehouse in either of the following modes:

Maximized

This mode is enabled by specifying the same value for both maximum and minimum number of warehouses (note that the specified value must be larger than 1). In this mode, when the multi-cluster warehouse is started, Snowflake starts all the warehouses so that maximum resources are available while the multi-cluster warehouse is running.

Auto-scale

This mode is enabled by specifying different values for maximum and minimum number of warehouses. In this mode, Snowflake starts and stops warehouses as needed to dynamically manage the load on the multi-cluster warehouse:

As the number of concurrent user sessions and/or queries for the multi-cluster warehouse increases, and queries start to queue due to insufficient resources, Snowflake automatically starts additional warehouses, up to the maximum number defined for the multi-cluster warehouse. Similarly, as the load on the multi-cluster warehouse decreases, Snowflake automatically shuts down warehouses to reduce the number of running warehouses.

Benefits of Multi-cluster Warehouses

With a standard, single-cluster warehouse, if your user/query load increases to the point where you need more compute resources:

You must either increase the size of the warehouse or start additional warehouses and explicitly redirect the additional users/queries to these warehouses.
Then, when the resources are no longer needed, to conserve credits, you must manually downsize the larger warehouse or suspend the additional warehouses.

In contrast, a multi-cluster warehouse enables larger numbers of users to connect to the same size warehouse. In addition:

In Auto-scale mode, a multi-cluster warehouse eliminates the need for resizing the warehouse or starting and stopping additional warehouses to handle fluctuating workloads. Snowflake automatically starts and stops additional warehouses as needed.
In Maximized mode, you can control the capacity of the multi-cluster warehouse by increasing or decreasing the number of warehouses as needed.

Day 3:SnowFlake's Storage Layer

Micro-partitions & Data Clustering

Traditional data warehouses rely on static partitioning of large tables to achieve acceptable performance and enable better scaling. In these systems, a partition is a unit of management that is manipulated independently using specialized DDL and syntax; however, static partitioning has a number of well-known limitations, such as maintenance overhead and data skew, which can result in disproportionately-sized partitions.

In contrast to a data warehouse, the Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that delivers all the advantages of static partitioning without the known limitations, as well as providing additional significant benefits.

What are Micro-partitions?

All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed). Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion. This size and structure allows for extremely granular pruning of very large tables, which can be comprised of millions, or even hundreds of millions, of micro-partitions.

Snowflake stores metadata about all rows stored in a micro-partition, including:

The range of values for each of the columns in the micro-partition.
The number of distinct values.
Additional properties used for both optimization and efficient query processing.

Benefits of Micro-partitioning

In contrast to traditional static partitioning, Snowflake micro-partitions are derived automatically; they don’t need to be explicitly defined up-front or maintained by users.
As the name suggests, micro-partitions are small in size (50 to 500 MB, before compression), which enables extremely efficient DML and fine-grained pruning for faster queries.
Micro-partitions can overlap in their range of values, which, combined with their uniformly small size, helps prevent skew.
Columns are stored independently within micro-partitions, often referred to as columnar storage. This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
Columns are also compressed individually within micro-partitions. Snowflake automatically determines the most efficient compression algorithm for the columns in each micro-partition.

Impact of Micro partitions:

DML

All DML operations (e.g. DELETE, UPDATE, MERGE) take advantage of the underlying micro-partition metadata to facilitate and simplify table maintenance. For example, some operations, such as deleting all rows from a table, are metadata-only operations.

Query Pruning

For example, assume a large table contains one year of historical data with date and hour columns. Assuming uniform distribution of the data, a query targeting a particular hour would ideally scan 1/8760th of the micro-partitions in the table and then only scan the portion of the micro-partitions that contain the data for the hour column; Snowflake uses columnar scanning of partitions so that an entire partition is not scanned if a query only filters by one column.

In other words, the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected, the more efficient is the pruning performed on the table.

Data Clustering

In Snowflake, as data is inserted/loaded into a table, clustering metadata is collected and recorded for each micro-partition created during the process. Snowflake then leverages this clustering information to avoid unnecessary scanning of micro-partitions during querying, significantly accelerating the performance of queries that reference these column.

The table consists of 24 rows stored across 4 micro-partitions, with the rows divided equally between each micro-partition. Within each micro-partition, the data is sorted and stored by column, which enables Snowflake to perform the following actions for queries on the table:

First, prune micro-partitions that are not needed for the query.

Then, prune by column within the remaining micro-partitions.

Clustering Information Maintained for Micro-partitions

Snowflake maintains clustering metadata for the micro-partitions in a table, including:

The total number of micro-partitions that comprise the table.
The number of micro-partitions containing values that overlap with each other (in a specified subset of table columns).
The depth of the overlapping micro-partitions.

Clustering Depth

The clustering depth for a populated table measures the average depth (1 or greater) of the overlapping micro-partitions for specified columns in a table. The smaller the average depth, the better clustered the table is with regards to the specified columns.

Clustering depth can be used for a variety of purposes, including:

Monitoring the clustering “health” of a large table, particularly over time as DML is performed on the table.
Determining whether a large table would benefit from explicitly defining a clustering key.

A table with no micro-partitions (i.e. an unpopulated/empty table) has a clustering depth of 0.

Clustering Depth Illustrated

The following diagram provides a conceptual example of a table consisting of five micro-partitions with values ranging from A to Z, and illustrates how overlap affects clustering depth:

Example of clustering depth

At the beginning, the range of values in all the micro-partitions overlap.
As the number of overlapping micro-partitions decreases, the overlap depth decreases.
When there is no overlap in the range of values across all micro-partitions, the micro-partitions are considered to be in a constant state (i.e. they cannot be improved by clustering).

Day 2:SnowFlake Architecture

Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.

Snowflake’s unique architecture consists of three key layers:

Database Storage
Query Processing
Cloud Services

Database Storage

When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage.

Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.

Query Processing

Query execution is performed in the processing layer. Snowflake processes queries using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.

Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.

Cloud Services

The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.

Services managed in this layer include:

Authentication
Infrastructure management
Metadata management
Query parsing and optimization
Access control

Understanding Snowflake Data Transfer Billing

Snowflake is provided as Software-as-a-Service (SaaS) that runs completely on cloud infrastructure. This means that all three layers of Snowflake’s architecture (storage, compute, and cloud services) are deployed and managed entirely on a selected cloud platform.

Tuesday, December 7, 2021

Day 1: SnowFlake Overview

Snowflake’s Data Cloud is powered by an advanced data platform provided as Software-as-a-Service (SaaS). Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings.

The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, Snowflake provides all of the functionality of an enterprise analytic database, along with many additional special features and unique capabilities.

Snowflake is a true SaaS offering. More specifically:

There is no hardware (virtual or physical) to select, install, configure, or manage.
There is virtually no software to install, configure, or manage.
Ongoing maintenance, management, upgrades, and tuning are handled by Snowflake.
Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than optional command line clients, drivers, and connectors), run in public cloud infrastructures.
Snowflake uses virtual compute instances for its compute needs and a storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).
Snowflake is not a packaged software offering that can be installed by a user. Snowflake manages all aspects of software installation and updates.

Six Features That Make Snowflake A Different Cloud Data Warehouse:

Snowflake's Cloud Data Platform is one of the go to tools for companies looking to upgrade to a modern architecture. We commonly have clients ask about Snowflake, and what are the features that make it standout from other cloud data warehouse solutions, such as Amazon Redshift or Azure Synapse.

Cloud Provider Agnostic

Snowflake is a cloud agnostic solution. It is a managed data warehouse solution that is available on all three cloud providers: AWS, Azure and GCP, while retaining the same end user experience. Customers can easily fit Snowflake into their current cloud architecture and have options to deploy in regions that makes sense for the business.

Scalability

Snowflakes multi-cluster shared data architecture separates out the compute and storage resources. This strategy enables users the ability to scale up resources when they need large amounts of data to be loaded faster, and scale back down when the process is finished without any interruption to service. Customers can start with an extra-small virtual warehouse and scale up and down as needed.

To ensure minimal administration, Snowflake has implemented auto-scaling and auto suspend features. Auto-scaling enables Snowflake to automatically start and stop clusters during unpredictable resource intensive processing. Auto-suspend, on the other hand, stops the virtual warehouse when clusters have been sitting idle for a defined period. These two concepts provide flexibility, performance optimization, as well as cost management.

Concurrency and Workload Separation

In a traditional data warehouse solution, users and processes would compete for resources resulting in concurrency issues. Hence the need for running ETL/ELT jobs in the middle of the night when no one is running reports. With Snowflake’s multi-cluster architecture, concurrency is no longer an issue. One of the key benefits of this architecture is separating out workloads to be executed against its own compute clusters called a virtual warehouse. Queries from one virtual warehouse will never affect queries from another. Having dedicated virtual warehouses to users and applications provides the possibility to run ETL/ELT processing, data analysis operations and reports without competing for resources.

Near-Zero Administration

Snowflake is delivered as a Data Warehouse as a service (DWaas). It enables companies to setup and manage a solution without significant involvement from DBA or IT teams. It does not require software to be installed or hardware to be commissioned. With modern features such as auto scaling, both increasing the virtual warehouse size as well as increasing clusters, gone are the days for server size and cluster management. Since Snowflake supports no indexes there is no need for tuning the database or indexing the tables. Software updates are handled by Snowflake and new features and patches are deployed with zero downtime.

Semi-Structured Data

The rise of NoSQL database solutions came from a need to handle semi structured data, typically in JSON format. To parse JSON, data pipelines needed to be developed to extract attributes and combine those attributes with structured data. Snowflake’s architecture allows the storage of structured and semi structured data in the same destination by utilizing a schema on read data type called VARIANT. The VARIANT data type can store both structured and semi structured data. As data gets loaded, Snowflake automatically parses the data and extracts the attributes and stores it in a columnar format. Hence eliminating the need for data extraction pipelines.

Security

From the way users access Snowflake to how data is stored, Snowflake has a wide array of security features. You can manage network polices by whitelisting IP addresses to restrict access to your account. Snowflake supports various authentication methods including two-factor authentication and support for SSO through federated authentication. Access to objects in the account is controlled through a hybrid model of discretionary access control (each object has an owner who grants access to the object) and role-based access control (privileges assigned to roles which are then assigned to users). This hybrid approach provides significant amount of control and flexibility. All data is automatically encrypted using AES 256 strong encryption and is encrypted in transit as well as at rest.

These are not the only reasons why Snowflake is different. There are other features that standout, however these are the ones we have seen our clients benefit from the most. Snowflake should be considered as a solution for any business migrating to a cloud Data Warehouse. One Six Solutions is a Snowflake partner and has implemented Snowflake Cloud Data Platform solutions for clients looking for a modern data architecture platform.