Friday, April 5, 2019

How to use Apache Drill

Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed  to support high-performance analysis on the semi-structured. It uses ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments.

Why Drill
Top  Reasons to Use Drill:
  1. Get started in minutes:It takes just a few minutes to get started with Drill. 
  2. Schema-free JSON model:No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.
  3. Query complex, semi-structured data in-situ:Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution.
  4. Leverage standard BI tools:Drill works with standard BI tools. You can use your existing tools, such as Tableau,
  5. Access multiple data sources:You can connect Drill out-of-the-box to file systems (local or distributed, such as S3 and HDFS), HBase and Hive
  6. High performance:Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant.
High-Level Architecture
At the core of Apache Drill is the "Drillbit" service, which is responsible for accepting requests from the client, processing the queries, and returning results to the client.
A Drillbit service can be installed and run on all of the required nodes in a Hadoop cluster to form a distributed cluster environment. When a Drillbit runs on each data node in the cluster, Drill can maximize data locality during query execution without moving data over the network or between nodes. Drill uses ZooKeeper to maintain cluster membership and health-check information.
Though Drill works in a Hadoop cluster environment, Drill is not tied to Hadoop and can run in any distributed cluster environment. The only pre-requisite for Drill is ZooKeeper.
When you submit a Drill query, a client or an application sends the query in the form of an SQL statement to a Drillbit in the Drill cluster. A Drillbit is the process running on each active Drill node that coordinates, plans, and executes queries, as well as distributes query work across the cluster to maximize data locality.
The following image represents the communication between clients, applications, and Drillbits:


The Drillbit that receives the query from a client or application becomes the Foreman for the query and drives the entire query. A parser in the Foreman parses the SQL, applying custom rules to convert specific SQL operators into a specific logical operator syntax that Drill understands. This collection of logical operators forms a logical plan. The logical plan describes the work required to generate the query results and defines which data sources and operations to apply.
The Foreman sends the logical plan into a cost-based optimizer to optimize the order of SQL operators in a statement and read the logical plan. The optimizer applies various types of rules to rearrange operators and functions into an optimal plan. The optimizer converts the logical plan into a physical plan that describes how to execute the query.


A parallelizer in the Foreman transforms the physical plan into multiple phases, called major and minor fragments. These fragments create a multi-level execution tree that rewrites the query and executes it in parallel against the configured data sources, sending the results back to the client or application.
A major fragment is a concept that represents a phase of the query execution. A phase can consist of one or multiple operations that Drill must perform to execute the query. Drill assigns each major fragment a MajorFragmentID.
For example, to perform a hash aggregation of two files, Drill may create a plan with two major phases (major fragments) where the first phase is dedicated to scanning the two files and the second phase is dedicated to the aggregation of the data.

Drill uses an exchange operator to separate major fragments. 
Major fragments do not actually perform any query tasks. Each major fragment is divided into one or multiple minor fragments (discussed in the next section) that actually execute the operations required to complete the query and return results back to the client.
Each major fragment is parallelized into minor fragments. A minor fragment is a logical unit of work that runs inside a thread. A logical unit of work in Drill is also referred to as a slice. The execution plan that Drill creates is composed of minor fragments. Drill assigns each minor fragment a MinorFragmentID.
The parallelizer in the Foreman creates one or more minor fragments from a major fragment at execution time, by breaking a major fragment into as many minor fragments as it can usefully run at the same time on the cluster.
Drill executes each minor fragment in its own thread as quickly as possible based on its upstream data requirements. Drill schedules the minor fragments on nodes with data locality. Otherwise, Drill schedules them in a round-robin fashion on the existing, available Drillbits.
Minor fragments can run as root, intermediate, or leaf fragments. An execution tree contains only one root fragment. Data flows downstream from the leaf fragments to the root fragment.
The root fragment runs in the Foreman and receives incoming queries, reads metadata from tables, rewrites the queries and routes them to the next level in the serving tree. The other fragments become intermediate or leaf fragments.
Intermediate fragments start work when data is available or fed to them from other fragments. They perform operations on the data and then send the data downstream. They also pass the aggregated results to the root fragment, which performs further aggregation and provides the query results to the client or application.
The leaf fragments scan tables in parallel and communicate with the storage layer or access data on local disk. The leaf fragments pass partial results to the intermediate fragments, which perform parallel operations on intermediate results.
Drill only plans queries that have concurrent running fragments. For example, if 20 available slices exist in the cluster, Drill plans a query that runs no more than 20 minor fragments in a particular major fragment. Drill is optimistic and assumes that it can complete all of the work in parallel. All minor fragments for a particular major fragment start at the same time based on their upstream data dependency.
The following image represents components within each Drillbit:
drill query flow
The following list describes the key components of a Drillbit:
  • RPC endpoint: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients. 
  • SQL parser: Drill uses Calcite, the open source SQL parser framework, to parse incoming queries. The output of the parser component is a language agnostic, computer-friendly logical plan that represents the query.
  • Storage plugin interface: Drill serves as a query layer on top of several data sources.In the context of Hadoop, Drill provides storage plugins for distributed files and HBase. Drill also integrates with Hive using a storage plugin.

Drill Session:
You can use a jdbc connection string to connect to SQLLine when Drill is installed in embedded mode or distributed mode, as shown in the following examples:
  • Embedded mode:./sqlline -u jdbc:drill:drillbit=local
  • Distributed mode:./sqlline –u jdbc:drill:zk=cento23,centos24,centos26:2181

For creating the drill session ,open the putty and type "sqlline".

You can write simple queries of sql like "show databases".It will list all the databases.

Workspaces:
You can create your own workspace in drill. Workspace is nothing but the directory in which you can create your views / tables. You can define one or more workspaces in a storage plugin configuration.
dfs plugin

Attribute-workspaces". . . "location
Example-"location": "/Users/johndoe/mydata"

VIEW:
The CREATE VIEW command creates a virtual structure for the result set of a stored query. A view can combine data from multiple underlying data sources and provide the illusion that all of the data is from one source. You can use views to protect sensitive data, for data aggregation, and to hide data complexity from users. You can create Drill views from files in your local and distributed file systems, such as Hive and HBase tables, as well as from existing views or any other available storage plugin data sources.
The CREATE VIEW command supports the following syntax:
CREATE [OR REPLACE] VIEW [workspace.]view_name [ (column_name [, ...]) ] AS query;

Parameters
  • workspace:The location where you want the view to exist. By default, the view is created in the current workspace. 
  • view_name:The name that you give the view. The view must have a unique name. It cannot have the same name as any other view or table in the workspace.
  • column_name:Optional list of column names in the view. If you do not supply column names, they are derived from the query.
  • query:A SELECT statement that defines the columns and rows in the view.


The following example shows a writable workspace as defined within the storage plugin in the /DrillView directory of the file system:

"workspaces": {
   "supply_view": {
     "location": "/a/b/DrillView",
     "writable": true,
     "defaultInputFormat": null
   }
 }

Drill stores the view definition in JSON format with the name that you specify when you run the CREATE VIEW command, suffixed by .view.drill. For example, if you create a view named myview, Drill stores the view in the designated workspace as myview.view.drill.

For example, i have created one view dummy in my workspace dfs.supply_view over a hive table employee.

  1. Select the workspace by command use dfs.supply_view;
  2. Create view dummy as select * from hive.`default`.employee.

Note 1 :You have to use escape character as default is reserved word in drill.

Here you can check your view file in linux file system by going to the workspace directory which you have provided in conf file.

Note 2:For hbase and binary tables you have to use function CONVERT_FROM.

For example
create view dfs.supply_view.mydrill_bang as SELECT CONVERT_FROM(row_key, 'UTF8') AS name, 

CONVERT_FROM(customer.addr.city, 'UTF8') AS city,
CONVERT_FROM(customer.addr.state, 'UTF8') AS state,
CONVERT_FROM(customer.`order`.numb, 'UTF8') AS numb,
CONVERT_FROM(customer.`order`.`date`, 'UTF8') AS `date`
FROM customer
WHERE CONVERT_FROM(customer.addr.city, 'UTF8')='bengaluru';

WEB DRILL
You can run query in Drill web UI  as well.The Drill Web UI is one of several client interfaces that you can use to access Drill

Web UI Admin View
Accessing the Web UI
To access the Drill Web UI, enter the URL appropriate for your Drill configuration. The following list describes the URLs for various Drill configurations:
http://<IP address or host name>:8047
Use this URL when HTTPS support is disabled (the default).
https://<IP address or host name>:8047
Use this URL when HTTPS support is enabled.
http://localhost:8047
Use this URL when running Drill in embedded mode (./drill-embedded).
On accessing drill web UI .It looks like this.

Now click on query ,a window will pop like this.



Now write the query and you will see the results like this.


You can also check the performance of your query by going to the profiles.A profile is a summary of metrics collected for each query that Drill executes. Query profiles provide information that you can use to monitor and analyze query performance. When Drill executes a query, Drill writes the profile of each query to disk, which is either the local filesystem or a distributed file system, such as HDFS.
You can view query profiles in the Profiles tab of the Drill Web UI. When you select the Profiles tab, you see a list of the last 100 queries than ran or are currently running in the cluster.

You must click on a query to see its profile.
The profile hold all the information for the query like physical plan,visualized plan.It conatins all the info like elapsed time between hive ,total fragments,total cost.

By reading all this you can optimise your query and re-write it in a optimised way.

Thats all.

Thanks for reading!

Bye!











Monday, April 1, 2019

Gradient Descent


Hello, and welcome to this blog on  Gradient Descent. 


Derivative:

As the name suggest, a derivative is a function that derives from another function.
Let's start with an example. Imagine you are driving on the highway as time goes by, you mark your position along the highway filling a table of values as a function of time. If you're speed is 60 miles an hour every minute your position will be increased by one mile.




 Let's define the function, x(t) to indicate your position as a function of time. The derivative of this function is the rate of change in position with respect to time.



So the speed will be 

At each point along the curve the derivative is the value of the slope of the curve itself. We can calculate the approximate value of the slope by the method of finite differences. The value of the derivative is negative when the slope of the original curve is downhill and it is positive when the slope is uphill. 

Gradient:

When our function has more than one input we need to specify which variable we are using for derivation. For example, let's say we are measuring our elevation on a mountain as a function of our position. Our GPS position is defined by two variables longitude and latitude. 
elevation=f(long,lati)
And therefore, the elevation depends on two variables. Let's change the variable names to shorter ones. Let's call the elevation y, and the two variables xand x2
y=f(x1,x2)


We can calculate the rate of change in elevation with respect to x1, and the rate of change with respect to x2

These are called partial derivatives because we only consider the change with respect to one variable. Notice also that we use the different symbol to indicate these derivatives because these are partial derivatives.
 In the two-dimensional plane of xand x2, the direction of the most abrupt change will be a two-dimensional vector whose components are the partial derivatives with respect to each variable.


 We call this vector gradient and we indicate it with an inverted triangle 
which is also called del or nabla. 

The gradient is an operation that takes a function of multiple variables and returns a vector. The components of this vector are all the partial derivatives of the function. Since the partial derivatives are functions of all variables, the gradient, too, is a function of all variables.

Back-propagation Intuition:

Let's say we have a function of only one variable called w. For every value on the horizontal axis, the function associates a value on the vertical axis. Let's say we're sitting at a particular point like in the figure.

 Let's also assume that we do not know the function f(w) at every possible point. We only know it near where we are. We want to move in the direction of decreasing f(w), but we can only use local information. How do we decide where to go? As we've when we talked about descending from a hill, the derivative indicates its slope at each point, so we can calculate the derivative where we are, and then change our position by subtracting the value of the derivative from the value of our starting position w. In other words, we can take one step, following the rule 



If we are sitting at w, the slope of the curve is negative, and thus, the quantity minus


 is positive, so the value of w will increase.So we have to move towards the right on the horizontal axis.  So the corresponding value on the vertical axis will decrease, so we successfully moved towards the lower value of the function f of w. 




This way of looking for the minimum of a function is called gradient descent, and it's the idea behind back-propagation. Given a function, we can always move towards its minimum by following the path indicated by its derivative, or, in the case of multiple variables, indicated by the gradient. As you know by now, for a neural network, we define a cost function that depends on the values of the parameters, and as you also know, we find the values of the parameters by minimizing the cost by gradient descent. All we are really doing is taking the cost function, calculating its partial derivatives with respect to each parameter, and then using the update rule we just described to decrease the cost by updating the parameter. We do this by subtracting the value of the negative gradient from each of the parameters. 
This is what's called a parameter update.

 Learning Rate:

As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.
Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long:



Same U-shaped curve. Lots of points are very close to each other and their trail is making extremely slow progress towards the bottom of the U.

Learning rate is too small.
Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:



Same U-shaped curve. This one contains very few points. The trail of points jumps clean across the bottom of the U and then jumps back over again.

Learning rate is too large.
There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.



Same U-shaped curve. The trail of points gets to the minimum point in about eight steps.


Learning rate is just right.

Reducing Loss: Stochastic Gradient Descent

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we've assumed that the batch has been the entire data set. But in real time the data sets often contain billions or even hundreds of billions of examples. Consequently, a batch can be enormous. A very large batch may cause even a single iteration to take a very long time to compute.
A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.
What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Mom :Difficult Word but easy to pronunce

Hi bloggers ,I know this is not good post to read but i know you will all relate with this emotion. When you around with your Mom,you feel s...