THE WORLD OF DATA ANALYTICS: December 2015

Monday, December 21, 2015

HIVE TUTORIAL

About Hive™

Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop^® support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.
Their creation, called Hive^™, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix).
Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
If you're interested in SQL on Hadoop, in addition to Hive, IBM offers Big SQL which makes accessing Hive datasets faster and more secure. Checkout our videos, below, for a quick overview of Hive and Big SQL.

To continue with the tutorial continue by downloading data from the below link:

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111

Starting the HIVE:

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).
Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Once the files are uploaded they should look like this

beeswax

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query

Once you execute the query you will see the results as shown in the screenshot below.

https://abhigargesh.files.wordpress.com/2015/12/temp_batting-data-display.jpg

Now we create a new table called BATTING which will contain 3 columns namely (player_id, year and number of runs)

create table batting (player_id STRING, year INT, runs INT);

https://abhigargesh.files.wordpress.com/2015/12/creating-batting-table1.jpg

Extract the data from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Once query is executed,see the job status by entering the below 
address in your web browser 127.0.0.1:8088

Next,group the data by year with the below code



SELECT year, max(runs) FROM batting GROUP BY year;




OUTPUT

next step is to figure out runs scored by players individually for that given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;

FINAL OUTPUT

Thanks

Sunday, December 6, 2015

Using APACHE PIG in HADOOP

Last time we have seen how to write our first hadoop program now lets excecute out first PIG program in hadoop

Before going into the program let us learn what is PIG??

Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on hadoop clusters

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the ApacheHive data warehouse software provides a SQL-like interface for Hadoop that doesn't require direct MapReduce programming,
The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Managers of the Apache software foundation's Pig project position the language as being part way between declarative SQL and the procedural JAVA approach used in MapReduce applications. Proponents say, for example, that data joins are easier to create with Pig Latin than with Java. However, through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JAVASCRIPT and Python.
Apache Pig grew out of work at Yahoo Research and was first formally described in a paper published in 2008. Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data. That omnivorous view of data likely had a hand in the decision to name the environment for the common barnyard animal. It also extends to Pig's take on application frameworks; while the technology is primarily associated with Hadoop, it is said to be capable of being used with other frameworks as well.

Objective :

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

For free flow and continue along with the blog data can be downloaded from the following link.
http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Like our previous blog run the hortonworks from the virtual box and after running it open the following link to do APACHE PIG , URL:http://127.0.0.1:8000

Login Details :

password : 1111

You get to the hue screen as shown below and go to the file browser.

Once you have opened hue screen, navigate to file browser and upload the two csv files.

Once the files are uploaded click on the PIG icon on the top left corner of your screen to go to the PIG script page.

We need to write the following code and save it.

batting = load 'Batting.csv' using
PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;

The explanation of above code is as follows:-

We load data using a comma delimiter.
Then we filter the first row of data.
Iteration for batting data object.
We should group the runs of each player by the year field.
We then join the runs data of highest scoring player to obtain player ID.

Once the script is ready you hit the Execute button to start the job and it will show the job running status

Job Status

To access this page you can either go to the job id which is displayed in bottom of the page when it says that the job is running successfully or go to Query History on the top left besides my scripts.

Once it is success you will get the following screen

Output will be obtained like below

Conclusion & Learning:
By this we have completed our task of executing the Pig script and obtaining the result of which player has highest runs from the year 1871 to 2011.