Automated World

Tuesday, February 14, 2023

Mom :Difficult Word but easy to pronunce

Hi bloggers ,I know this is not good post to read but i know you will all relate with this emotion. When you around with your Mom,you feel secure.Its like everything is in control. But when you become one OH MY GOD! Your life become upside down.

You cannot think straight .You are always puzzled .You are always clueless.Being Mom is really difficult.I know it’s beautiful emotion.Its a gift blah blah.

But nobody thought how she feels?Why does become multitasker?Why she is not looking good today?

The moment she got to know that she is pregnant,She become a Mom.Her life got changed.Now everything become about the baby.Everytime she is rolling her hand over her tummy and checking and asking “Is everything alright?””Is all good there?”.

Every new Mom or to be Mom is always worried that if I eat something bad maybe it will hurt my baby. If I do something bad or take wrong step,It will hurt my baby.So you are constant worried and terrified for no reasons.Whenever she went for ultrasound shewas praying to the God.She just want her baby safe and healthy.

I know all the Moms in the world are doing the same thing.They forgot about themselves.They forgot to check that they are nails are not trimmed and the nail paint is not there. There hair are looking messy

I know we heard so many poems about her.How she comprises everything for her family .But sometimes you don’t want appreciation ,you just want one shoulder where you can share the pressure .That person can be anyone maybe its your husband ,your mother ,your mother in law ,your any other family member. That pressure should be shared . It’s difficult to handle everything single handedly.

Enough about the word .Let me tell you about myself.I am a Mom.Surprised! hehe.

I am a mom of 1 year old boy.Let me tell you although I am a very fresher but in this one year i got ton of experience of being Mom. I am a software engineer as well and a beautiful wife. I can say I am beautiful atleast when I was married I was beautiful as per my husband. Just checked with him,He is fine with this comment so I think we are good to go.
So as I said I have ton of experience. I know how to become angry mom,happy mom,super energetic mom,multi tasker mom,terrified mom and super lazy mom infact super bad mom. When you become one everyday one tag will be added in front of your name.But after everything,after every pain .I am blessed that I have became a Mom. When I hold him ,I feel lucky and everything become justified and worthy.

Thursday, December 9, 2021

Day 4: Virtual Warehouse(Part-1)

Virtual Warehouses, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. A warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform the following operations in a Snowflake session:

Executing SQL SELECT statements that require compute resources (e.g. retrieving rows from tables and views).
Performing DML operations, such as

Updating rows in tables (DELETE , INSERT , UPDATE).
Loading data into tables (COPY INTO <table>).
Unloading data from tables (COPY INTO <location>).

Overview of Virtual Warehouse:

Warehouses are required for queries, as well as all DML operations, including loading data into tables. A warehouse is defined by its size, as well as the other properties that can be set to help control and automate warehouse activity.Warehouses can be started and stopped at any time. They can also be resized at any time, even while running, to accommodate the need for more or less compute resources, based on the type of operations being performed by the warehouse.

Impact on Credit Usage and Billing

There is a doubling of credit usage as you increase in size to the next larger warehouse size for each full hour that the warehouse runs; however, note that Snowflake utilizes per-second billing (with a 60-second minimum each time the warehouse starts) so warehouses are billed only for the credits they actually consume.

Note: For a multi-cluster warehouse, the number of credits billed is calculated based on the multi-cluster warehouse size and the number of warehouses that run within the time period. For example, if a 3X-Large multi-cluster warehouse runs 1 warehouse for one full hour and then runs 2 warehouses for the next full hour, the total number of credits billed would be 192 (i.e. 64 + 128).

Impact on Data Loading

Increasing the size of a warehouse does not always improve data loading performance. Data loading performance is influenced more by the number of files being loaded (and the size of each file) than the size of the warehouse.

Tip: Unless you are bulk loading a large number of files concurrently (i.e. hundreds or thousands of files), a smaller warehouse (Small, Medium, Large) is generally sufficient. Using a larger warehouse (X-Large, 2X-Large, etc.) will consume more credits and may not result in any performance increase.

Impact on Query Processing

The size of a warehouse can impact the amount of time required to execute queries submitted to the warehouse, particularly for larger, more complex queries. In general, query performance scales with warehouse size because larger warehouses have more compute resources available to process queries.

If queries processed by a warehouse are running slowly, you can always resize the warehouse to provision more compute resources. The additional resources do not impact any queries that are already running, but once they are fully provisioned they become available for use by any queries that are queued or newly submitted.

Auto-suspension and Auto-resumption

A warehouse can be set to automatically resume or suspend, based on activity:

By default, auto-suspend is enabled. Snowflake automatically suspends the warehouse if it is inactive for the specified period of time.
By default, auto-resume is enabled. Snowflake automatically resumes the warehouse when any statement that requires a warehouse is submitted and the warehouse is the current warehouse for the session.

What is a Multi-cluster Warehouse?

By default, the size of a virtual warehouse determines the compute resources available to the warehouse for executing queries. Each warehouse is a set of compute resources. As queries are submitted to a warehouse, the warehouse allocates resources to each query and begins executing the queries. If sufficient resources are not available to execute all the queries submitted to the warehouse, Snowflake queues the additional queries until the necessary resources become available.

With multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, additional warehouses to make a larger pool of compute resources available. A multi-cluster warehouse is defined by specifying the following properties:

Maximum number of warehouses, greater than 1 (up to 10).
Minimum number of warehouses, equal to or less than the maximum (up to 10).

Additionally, multi-cluster warehouses support all the same properties and actions as single warehouses, including:

Specifying a warehouse size.
Resizing a warehouse at any time.
Auto-suspending a running warehouse due to inactivity; note that this does not apply to individual warehouses, but rather the entire multi-cluster warehouse.
Auto-resuming a suspended warehouse when new queries are submitted.

Maximized vs. Auto-scale

You can choose to run a multi-cluster warehouse in either of the following modes:

Maximized

This mode is enabled by specifying the same value for both maximum and minimum number of warehouses (note that the specified value must be larger than 1). In this mode, when the multi-cluster warehouse is started, Snowflake starts all the warehouses so that maximum resources are available while the multi-cluster warehouse is running.

Auto-scale

This mode is enabled by specifying different values for maximum and minimum number of warehouses. In this mode, Snowflake starts and stops warehouses as needed to dynamically manage the load on the multi-cluster warehouse:

As the number of concurrent user sessions and/or queries for the multi-cluster warehouse increases, and queries start to queue due to insufficient resources, Snowflake automatically starts additional warehouses, up to the maximum number defined for the multi-cluster warehouse. Similarly, as the load on the multi-cluster warehouse decreases, Snowflake automatically shuts down warehouses to reduce the number of running warehouses.

Benefits of Multi-cluster Warehouses

With a standard, single-cluster warehouse, if your user/query load increases to the point where you need more compute resources:

You must either increase the size of the warehouse or start additional warehouses and explicitly redirect the additional users/queries to these warehouses.
Then, when the resources are no longer needed, to conserve credits, you must manually downsize the larger warehouse or suspend the additional warehouses.

In contrast, a multi-cluster warehouse enables larger numbers of users to connect to the same size warehouse. In addition:

In Auto-scale mode, a multi-cluster warehouse eliminates the need for resizing the warehouse or starting and stopping additional warehouses to handle fluctuating workloads. Snowflake automatically starts and stops additional warehouses as needed.
In Maximized mode, you can control the capacity of the multi-cluster warehouse by increasing or decreasing the number of warehouses as needed.

Day 3:SnowFlake's Storage Layer

Micro-partitions & Data Clustering

Traditional data warehouses rely on static partitioning of large tables to achieve acceptable performance and enable better scaling. In these systems, a partition is a unit of management that is manipulated independently using specialized DDL and syntax; however, static partitioning has a number of well-known limitations, such as maintenance overhead and data skew, which can result in disproportionately-sized partitions.

In contrast to a data warehouse, the Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that delivers all the advantages of static partitioning without the known limitations, as well as providing additional significant benefits.

What are Micro-partitions?

All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed). Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion. This size and structure allows for extremely granular pruning of very large tables, which can be comprised of millions, or even hundreds of millions, of micro-partitions.

Snowflake stores metadata about all rows stored in a micro-partition, including:

The range of values for each of the columns in the micro-partition.
The number of distinct values.
Additional properties used for both optimization and efficient query processing.

Benefits of Micro-partitioning

In contrast to traditional static partitioning, Snowflake micro-partitions are derived automatically; they don’t need to be explicitly defined up-front or maintained by users.
As the name suggests, micro-partitions are small in size (50 to 500 MB, before compression), which enables extremely efficient DML and fine-grained pruning for faster queries.
Micro-partitions can overlap in their range of values, which, combined with their uniformly small size, helps prevent skew.
Columns are stored independently within micro-partitions, often referred to as columnar storage. This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
Columns are also compressed individually within micro-partitions. Snowflake automatically determines the most efficient compression algorithm for the columns in each micro-partition.

Impact of Micro partitions:

DML

All DML operations (e.g. DELETE, UPDATE, MERGE) take advantage of the underlying micro-partition metadata to facilitate and simplify table maintenance. For example, some operations, such as deleting all rows from a table, are metadata-only operations.

Query Pruning

For example, assume a large table contains one year of historical data with date and hour columns. Assuming uniform distribution of the data, a query targeting a particular hour would ideally scan 1/8760th of the micro-partitions in the table and then only scan the portion of the micro-partitions that contain the data for the hour column; Snowflake uses columnar scanning of partitions so that an entire partition is not scanned if a query only filters by one column.

In other words, the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected, the more efficient is the pruning performed on the table.

Data Clustering

In Snowflake, as data is inserted/loaded into a table, clustering metadata is collected and recorded for each micro-partition created during the process. Snowflake then leverages this clustering information to avoid unnecessary scanning of micro-partitions during querying, significantly accelerating the performance of queries that reference these column.

The table consists of 24 rows stored across 4 micro-partitions, with the rows divided equally between each micro-partition. Within each micro-partition, the data is sorted and stored by column, which enables Snowflake to perform the following actions for queries on the table:

First, prune micro-partitions that are not needed for the query.

Then, prune by column within the remaining micro-partitions.

Clustering Information Maintained for Micro-partitions

Snowflake maintains clustering metadata for the micro-partitions in a table, including:

The total number of micro-partitions that comprise the table.
The number of micro-partitions containing values that overlap with each other (in a specified subset of table columns).
The depth of the overlapping micro-partitions.

Clustering Depth

The clustering depth for a populated table measures the average depth (1 or greater) of the overlapping micro-partitions for specified columns in a table. The smaller the average depth, the better clustered the table is with regards to the specified columns.

Clustering depth can be used for a variety of purposes, including:

Monitoring the clustering “health” of a large table, particularly over time as DML is performed on the table.
Determining whether a large table would benefit from explicitly defining a clustering key.

A table with no micro-partitions (i.e. an unpopulated/empty table) has a clustering depth of 0.

Clustering Depth Illustrated

The following diagram provides a conceptual example of a table consisting of five micro-partitions with values ranging from A to Z, and illustrates how overlap affects clustering depth:

Example of clustering depth

At the beginning, the range of values in all the micro-partitions overlap.
As the number of overlapping micro-partitions decreases, the overlap depth decreases.
When there is no overlap in the range of values across all micro-partitions, the micro-partitions are considered to be in a constant state (i.e. they cannot be improved by clustering).

Day 2:SnowFlake Architecture

Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.

Snowflake’s unique architecture consists of three key layers:

Database Storage
Query Processing
Cloud Services

Database Storage

When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage.

Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.

Query Processing

Query execution is performed in the processing layer. Snowflake processes queries using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.

Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.

Cloud Services

The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.

Services managed in this layer include:

Authentication
Infrastructure management
Metadata management
Query parsing and optimization
Access control

Understanding Snowflake Data Transfer Billing

Snowflake is provided as Software-as-a-Service (SaaS) that runs completely on cloud infrastructure. This means that all three layers of Snowflake’s architecture (storage, compute, and cloud services) are deployed and managed entirely on a selected cloud platform.

Tuesday, December 7, 2021

Day 1: SnowFlake Overview

Snowflake’s Data Cloud is powered by an advanced data platform provided as Software-as-a-Service (SaaS). Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings.

The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, Snowflake provides all of the functionality of an enterprise analytic database, along with many additional special features and unique capabilities.

Snowflake is a true SaaS offering. More specifically:

There is no hardware (virtual or physical) to select, install, configure, or manage.
There is virtually no software to install, configure, or manage.
Ongoing maintenance, management, upgrades, and tuning are handled by Snowflake.
Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than optional command line clients, drivers, and connectors), run in public cloud infrastructures.
Snowflake uses virtual compute instances for its compute needs and a storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).
Snowflake is not a packaged software offering that can be installed by a user. Snowflake manages all aspects of software installation and updates.

Six Features That Make Snowflake A Different Cloud Data Warehouse:

Snowflake's Cloud Data Platform is one of the go to tools for companies looking to upgrade to a modern architecture. We commonly have clients ask about Snowflake, and what are the features that make it standout from other cloud data warehouse solutions, such as Amazon Redshift or Azure Synapse.

Cloud Provider Agnostic

Snowflake is a cloud agnostic solution. It is a managed data warehouse solution that is available on all three cloud providers: AWS, Azure and GCP, while retaining the same end user experience. Customers can easily fit Snowflake into their current cloud architecture and have options to deploy in regions that makes sense for the business.

Scalability

Snowflakes multi-cluster shared data architecture separates out the compute and storage resources. This strategy enables users the ability to scale up resources when they need large amounts of data to be loaded faster, and scale back down when the process is finished without any interruption to service. Customers can start with an extra-small virtual warehouse and scale up and down as needed.

To ensure minimal administration, Snowflake has implemented auto-scaling and auto suspend features. Auto-scaling enables Snowflake to automatically start and stop clusters during unpredictable resource intensive processing. Auto-suspend, on the other hand, stops the virtual warehouse when clusters have been sitting idle for a defined period. These two concepts provide flexibility, performance optimization, as well as cost management.

Concurrency and Workload Separation

In a traditional data warehouse solution, users and processes would compete for resources resulting in concurrency issues. Hence the need for running ETL/ELT jobs in the middle of the night when no one is running reports. With Snowflake’s multi-cluster architecture, concurrency is no longer an issue. One of the key benefits of this architecture is separating out workloads to be executed against its own compute clusters called a virtual warehouse. Queries from one virtual warehouse will never affect queries from another. Having dedicated virtual warehouses to users and applications provides the possibility to run ETL/ELT processing, data analysis operations and reports without competing for resources.

Near-Zero Administration

Snowflake is delivered as a Data Warehouse as a service (DWaas). It enables companies to setup and manage a solution without significant involvement from DBA or IT teams. It does not require software to be installed or hardware to be commissioned. With modern features such as auto scaling, both increasing the virtual warehouse size as well as increasing clusters, gone are the days for server size and cluster management. Since Snowflake supports no indexes there is no need for tuning the database or indexing the tables. Software updates are handled by Snowflake and new features and patches are deployed with zero downtime.

Semi-Structured Data

The rise of NoSQL database solutions came from a need to handle semi structured data, typically in JSON format. To parse JSON, data pipelines needed to be developed to extract attributes and combine those attributes with structured data. Snowflake’s architecture allows the storage of structured and semi structured data in the same destination by utilizing a schema on read data type called VARIANT. The VARIANT data type can store both structured and semi structured data. As data gets loaded, Snowflake automatically parses the data and extracts the attributes and stores it in a columnar format. Hence eliminating the need for data extraction pipelines.

Security

From the way users access Snowflake to how data is stored, Snowflake has a wide array of security features. You can manage network polices by whitelisting IP addresses to restrict access to your account. Snowflake supports various authentication methods including two-factor authentication and support for SSO through federated authentication. Access to objects in the account is controlled through a hybrid model of discretionary access control (each object has an owner who grants access to the object) and role-based access control (privileges assigned to roles which are then assigned to users). This hybrid approach provides significant amount of control and flexibility. All data is automatically encrypted using AES 256 strong encryption and is encrypted in transit as well as at rest.

These are not the only reasons why Snowflake is different. There are other features that standout, however these are the ones we have seen our clients benefit from the most. Snowflake should be considered as a solution for any business migrating to a cloud Data Warehouse. One Six Solutions is a Snowflake partner and has implemented Snowflake Cloud Data Platform solutions for clients looking for a modern data architecture platform.

Friday, April 5, 2019

How to use Apache Drill

Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed to support high-performance analysis on the semi-structured. It uses ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments.

Why Drill

Top Reasons to Use Drill:

Get started in minutes:It takes just a few minutes to get started with Drill.
Schema-free JSON model:No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.
Query complex, semi-structured data in-situ:Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution.
Leverage standard BI tools:Drill works with standard BI tools. You can use your existing tools, such as Tableau,
Access multiple data sources:You can connect Drill out-of-the-box to file systems (local or distributed, such as S3 and HDFS), HBase and Hive
High performance:Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant.

High-Level Architecture

At the core of Apache Drill is the "Drillbit" service, which is responsible for accepting requests from the client, processing the queries, and returning results to the client.

A Drillbit service can be installed and run on all of the required nodes in a Hadoop cluster to form a distributed cluster environment. When a Drillbit runs on each data node in the cluster, Drill can maximize data locality during query execution without moving data over the network or between nodes. Drill uses ZooKeeper to maintain cluster membership and health-check information.

Though Drill works in a Hadoop cluster environment, Drill is not tied to Hadoop and can run in any distributed cluster environment. The only pre-requisite for Drill is ZooKeeper.

When you submit a Drill query, a client or an application sends the query in the form of an SQL statement to a Drillbit in the Drill cluster. A Drillbit is the process running on each active Drill node that coordinates, plans, and executes queries, as well as distributes query work across the cluster to maximize data locality.

The following image represents the communication between clients, applications, and Drillbits:

The Drillbit that receives the query from a client or application becomes the Foreman for the query and drives the entire query. A parser in the Foreman parses the SQL, applying custom rules to convert specific SQL operators into a specific logical operator syntax that Drill understands. This collection of logical operators forms a logical plan. The logical plan describes the work required to generate the query results and defines which data sources and operations to apply.

The Foreman sends the logical plan into a cost-based optimizer to optimize the order of SQL operators in a statement and read the logical plan. The optimizer applies various types of rules to rearrange operators and functions into an optimal plan. The optimizer converts the logical plan into a physical plan that describes how to execute the query.

A parallelizer in the Foreman transforms the physical plan into multiple phases, called major and minor fragments. These fragments create a multi-level execution tree that rewrites the query and executes it in parallel against the configured data sources, sending the results back to the client or application.

A major fragment is a concept that represents a phase of the query execution. A phase can consist of one or multiple operations that Drill must perform to execute the query. Drill assigns each major fragment a MajorFragmentID.

For example, to perform a hash aggregation of two files, Drill may create a plan with two major phases (major fragments) where the first phase is dedicated to scanning the two files and the second phase is dedicated to the aggregation of the data.

Drill uses an exchange operator to separate major fragments.

Major fragments do not actually perform any query tasks. Each major fragment is divided into one or multiple minor fragments (discussed in the next section) that actually execute the operations required to complete the query and return results back to the client.

Each major fragment is parallelized into minor fragments. A minor fragment is a logical unit of work that runs inside a thread. A logical unit of work in Drill is also referred to as a slice. The execution plan that Drill creates is composed of minor fragments. Drill assigns each minor fragment a MinorFragmentID.

The parallelizer in the Foreman creates one or more minor fragments from a major fragment at execution time, by breaking a major fragment into as many minor fragments as it can usefully run at the same time on the cluster.

Drill executes each minor fragment in its own thread as quickly as possible based on its upstream data requirements. Drill schedules the minor fragments on nodes with data locality. Otherwise, Drill schedules them in a round-robin fashion on the existing, available Drillbits.

Minor fragments can run as root, intermediate, or leaf fragments. An execution tree contains only one root fragment. Data flows downstream from the leaf fragments to the root fragment.

The root fragment runs in the Foreman and receives incoming queries, reads metadata from tables, rewrites the queries and routes them to the next level in the serving tree. The other fragments become intermediate or leaf fragments.

Intermediate fragments start work when data is available or fed to them from other fragments. They perform operations on the data and then send the data downstream. They also pass the aggregated results to the root fragment, which performs further aggregation and provides the query results to the client or application.

The leaf fragments scan tables in parallel and communicate with the storage layer or access data on local disk. The leaf fragments pass partial results to the intermediate fragments, which perform parallel operations on intermediate results.

Drill only plans queries that have concurrent running fragments. For example, if 20 available slices exist in the cluster, Drill plans a query that runs no more than 20 minor fragments in a particular major fragment. Drill is optimistic and assumes that it can complete all of the work in parallel. All minor fragments for a particular major fragment start at the same time based on their upstream data dependency.

The following image represents components within each Drillbit:

The following list describes the key components of a Drillbit:

RPC endpoint: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients.
SQL parser: Drill uses Calcite, the open source SQL parser framework, to parse incoming queries. The output of the parser component is a language agnostic, computer-friendly logical plan that represents the query.
Storage plugin interface: Drill serves as a query layer on top of several data sources.In the context of Hadoop, Drill provides storage plugins for distributed files and HBase. Drill also integrates with Hive using a storage plugin.

Drill Session:

You can use a jdbc connection string to connect to SQLLine when Drill is installed in embedded mode or distributed mode, as shown in the following examples:

Embedded mode:./sqlline -u jdbc:drill:drillbit=local
Distributed mode:./sqlline –u jdbc:drill:zk=cento23,centos24,centos26:2181

For creating the drill session ,open the putty and type "sqlline".

You can write simple queries of sql like "show databases".It will list all the databases.

Workspaces:
You can create your own workspace in drill. Workspace is nothing but the directory in which you can create your views / tables. You can define one or more workspaces in a storage plugin configuration.

Attribute-workspaces". . . "location

Example-"location": "/Users/johndoe/mydata"

VIEW:

The CREATE VIEW command creates a virtual structure for the result set of a stored query. A view can combine data from multiple underlying data sources and provide the illusion that all of the data is from one source. You can use views to protect sensitive data, for data aggregation, and to hide data complexity from users. You can create Drill views from files in your local and distributed file systems, such as Hive and HBase tables, as well as from existing views or any other available storage plugin data sources.

The CREATE VIEW command supports the following syntax:

CREATE [OR REPLACE] VIEW [workspace.]view_name [ (column_name [, ...]) ] AS query;

Parameters

workspace:The location where you want the view to exist. By default, the view is created in the current workspace.
view_name:The name that you give the view. The view must have a unique name. It cannot have the same name as any other view or table in the workspace.
column_name:Optional list of column names in the view. If you do not supply column names, they are derived from the query.
query:A SELECT statement that defines the columns and rows in the view.

The following example shows a writable workspace as defined within the storage plugin in the /DrillView directory of the file system:

"workspaces": {

"supply_view": {

"location": "/a/b/DrillView",

"writable": true,

"defaultInputFormat": null

}

Drill stores the view definition in JSON format with the name that you specify when you run the CREATE VIEW command, suffixed by .view.drill. For example, if you create a view named myview, Drill stores the view in the designated workspace as myview.view.drill.

For example, i have created one view dummy in my workspace dfs.supply_view over a hive table employee.

Select the workspace by command use dfs.supply_view;
Create view dummy as select * from hive.`default`.employee.

Note 1 :You have to use escape character as default is reserved word in drill.

Here you can check your view file in linux file system by going to the workspace directory which you have provided in conf file.

Note 2:For hbase and binary tables you have to use function CONVERT_FROM.

For example

create view dfs.supply_view.mydrill_bang as SELECT CONVERT_FROM(row_key, 'UTF8') AS name,

CONVERT_FROM(customer.addr.city, 'UTF8') AS city,

CONVERT_FROM(customer.addr.state, 'UTF8') AS state,

CONVERT_FROM(customer.`order`.numb, 'UTF8') AS numb,

CONVERT_FROM(customer.`order`.`date`, 'UTF8') AS `date`

FROM customer

WHERE CONVERT_FROM(customer.addr.city, 'UTF8')='bengaluru';

WEB DRILL

You can run query in Drill web UI as well.The Drill Web UI is one of several client interfaces that you can use to access Drill

Accessing the Web UI

To access the Drill Web UI, enter the URL appropriate for your Drill configuration. The following list describes the URLs for various Drill configurations:

http://<IP address or host name>:8047
Use this URL when HTTPS support is disabled (the default).
https://<IP address or host name>:8047
Use this URL when HTTPS support is enabled.
http://localhost:8047
Use this URL when running Drill in embedded mode (./drill-embedded).

On accessing drill web UI .It looks like this.

Now click on query ,a window will pop like this.

Now write the query and you will see the results like this.

You can also check the performance of your query by going to the profiles.A profile is a summary of metrics collected for each query that Drill executes. Query profiles provide information that you can use to monitor and analyze query performance. When Drill executes a query, Drill writes the profile of each query to disk, which is either the local filesystem or a distributed file system, such as HDFS.

You can view query profiles in the Profiles tab of the Drill Web UI. When you select the Profiles tab, you see a list of the last 100 queries than ran or are currently running in the cluster.

You must click on a query to see its profile.

The profile hold all the information for the query like physical plan,visualized plan.It conatins all the info like elapsed time between hive ,total fragments,total cost.

By reading all this you can optimise your query and re-write it in a optimised way.

Thats all.

Thanks for reading!

Bye!