Monday, December 21, 2015

HIVE TUTORIAL

About Hive™

Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop® support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.
  Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
  For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix).
  Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
If you're interested in SQL on Hadoop, in addition to Hive, IBM offers Big SQL which makes accessing Hive datasets faster and more secure. Checkout our videos, below, for a quick overview of Hive and Big SQL.

To continue with the tutorial continue by downloading data  from the below link:

http://seanlahman.com/files/database/lahman591-csv.zip


Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111

Starting the HIVE:

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).
Uploading Data
Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.
Once the files are uploaded they should look like this

beeswax


Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query


1


Once  you execute the query you will see the results as shown in the screenshot below.


https://abhigargesh.files.wordpress.com/2015/12/temp_batting-data-display.jpg



Now we create a new table called BATTING  which will contain 3 columns namely (player_id, year and number of runs)

create table batting (player_id STRING, year INT, runs INT);

https://abhigargesh.files.wordpress.com/2015/12/creating-batting-table1.jpg

 Extract the data  from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting

Once query is executed,see the job status by entering the below address in your web browser 127.0.0.1:8088 

app_status.jpg 


Next,group the data by year with the below code


SELECT year, max(runs) FROM batting GROUP BY year;


group by year.jpg

OUTPUT

max runs_year1.jpg


next step is to figure out runs scored by players individually for that given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;

final join.jpg

FINAL OUTPUT
max runs_year1.jpg


Thanks

1 comment: