THE WORLD OF DATA ANALYTICS

Monday, December 21, 2015

HIVE TUTORIAL

About Hive™

Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop^® support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.
Their creation, called Hive^™, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix).
Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
If you're interested in SQL on Hadoop, in addition to Hive, IBM offers Big SQL which makes accessing Hive datasets faster and more secure. Checkout our videos, below, for a quick overview of Hive and Big SQL.

To continue with the tutorial continue by downloading data from the below link:

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111

Starting the HIVE:

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).
Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Once the files are uploaded they should look like this

beeswax

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query

Once you execute the query you will see the results as shown in the screenshot below.

https://abhigargesh.files.wordpress.com/2015/12/temp_batting-data-display.jpg

Now we create a new table called BATTING which will contain 3 columns namely (player_id, year and number of runs)

create table batting (player_id STRING, year INT, runs INT);

https://abhigargesh.files.wordpress.com/2015/12/creating-batting-table1.jpg

Extract the data from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Once query is executed,see the job status by entering the below 
address in your web browser 127.0.0.1:8088

Next,group the data by year with the below code



SELECT year, max(runs) FROM batting GROUP BY year;




OUTPUT

next step is to figure out runs scored by players individually for that given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;

FINAL OUTPUT

Thanks

Sunday, December 6, 2015

Using APACHE PIG in HADOOP

Last time we have seen how to write our first hadoop program now lets excecute out first PIG program in hadoop

Before going into the program let us learn what is PIG??

Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on hadoop clusters

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the ApacheHive data warehouse software provides a SQL-like interface for Hadoop that doesn't require direct MapReduce programming,
The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Managers of the Apache software foundation's Pig project position the language as being part way between declarative SQL and the procedural JAVA approach used in MapReduce applications. Proponents say, for example, that data joins are easier to create with Pig Latin than with Java. However, through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JAVASCRIPT and Python.
Apache Pig grew out of work at Yahoo Research and was first formally described in a paper published in 2008. Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data. That omnivorous view of data likely had a hand in the decision to name the environment for the common barnyard animal. It also extends to Pig's take on application frameworks; while the technology is primarily associated with Hadoop, it is said to be capable of being used with other frameworks as well.

Objective :

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

For free flow and continue along with the blog data can be downloaded from the following link.
http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Like our previous blog run the hortonworks from the virtual box and after running it open the following link to do APACHE PIG , URL:http://127.0.0.1:8000

Login Details :

password : 1111

You get to the hue screen as shown below and go to the file browser.

Once you have opened hue screen, navigate to file browser and upload the two csv files.

Once the files are uploaded click on the PIG icon on the top left corner of your screen to go to the PIG script page.

We need to write the following code and save it.

batting = load 'Batting.csv' using
PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;

The explanation of above code is as follows:-

We load data using a comma delimiter.
Then we filter the first row of data.
Iteration for batting data object.
We should group the runs of each player by the year field.
We then join the runs data of highest scoring player to obtain player ID.

Once the script is ready you hit the Execute button to start the job and it will show the job running status

Job Status

To access this page you can either go to the job id which is displayed in bottom of the page when it says that the job is running successfully or go to Query History on the top left besides my scripts.

Once it is success you will get the following screen

Output will be obtained like below

Conclusion & Learning:
By this we have completed our task of executing the Pig script and obtaining the result of which player has highest runs from the year 1871 to 2011.

Saturday, November 21, 2015

Excecuting First Program in HADOOP

Before executing our first program lets know what Hadoop is ...........

WHAT IS HADOOP?????

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

What are the benefits of HADOOP?

Computing power. Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have.
Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.
Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
Scalability. You can easily grow your system simply by adding more nodes. Little administration is required

What components make up Hadoop?

Currently, four core modules are included in the basic framework from the Apache Foundation:

Hadoop Common – the libraries and utilities used by other Hadoop modules.
Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization.
MapReduce – a software programming model for processing large sets of data in parallel.
YARN – resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator.)

Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:

Pig – a platform for manipulating data stored in HDFS that includes a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs.
Hive – a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.)
HBase – a nonrelational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs.
HCatalog – a table and storage management layer that helps users share and access data.
Ambari – a web interface for managing, configuring and testing Hadoop services and components.

Cassandra – A distributed database system.
Chukwa – a data collection system for monitoring large distributed systems.
Flume – software that collects, aggregates and moves large amounts of streaming data into HDFS.
Oozie – a Hadoop job scheduler.
Sqoop – a connection and transfer mechanism that moves data between Hadoop and relational databases.
Spark – an open-source cluster computing framework with in-memory analytics.
Solr – an scalable search tool that includes indexing, reliability, central configuration, failover and recovery.
Zookeeper – an application that coordinates distributed processes.

In addition, there are commercial distributions of Hadoop, including Cloudera, Hortonworks and MapR. With distributions from software vendors, you pay for their version of the framework and receive additional software components, tools, training, documentation and other services.

NOW LET"S EXECUTE OUR FIRST PROGRAM IN HADOOP

Basic step in learning analytics to get your hands dirty

Objective: we have to execute java coded MapReduce task of three large text files and count the frequency of words appeared in those text files using Hadoop under Hortonworks Data Platform installed on Oracle virtual box.

Framework:

First install the Oracle virtual box and then install hadoop in the virtual box and the installation will take some time.In the mean time have a COFFEE BREAK!!!!
After the installation whenever you want to do something using hortonworks hadoop you need to click on start button in the virtual box which will take some time and give you the screen as below after the completion of installation

After installation process and obtaining the screen as above your system will become very slow don't panic because hadoop requires so much of RAM so I request you to have atleast 6-8 GB of RAM to run hadoop,if you cannot afford buying a system but hungry to learn BIG DATA check out the AWS service which comes with a certain trail period

Copy paste that URL in your web Browser which opens a window and go to the advanced settings and start the hortonworks
It will ask for a username and password
username: root
password: hadoop(but while typing password you will not see the cursor moving don't worry just type hadoop and press enter)
As we have java codes ready we need to create these java files using linux vi command. After editing the document we need to give the following commands to save and exit the editor shell. :w for writing and :q to quit from editor window and come back to shell box.
Please look at the editor window opened in my shell using 127.0.0.1:4200

Below screen is where I edited my SumReducer.java, WordMapper.java and WordCount.java files.

Once your java files are ready for execution we need to create one new folder to save our class files which we are going to compile from java codes.
After creating a folder for class files. We have to execute the following code from shell.



javac -classpath 
/usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar
 -d WC-classes WordMapper.java

#-----

javac -classpath 
/usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar
 -d WC-classes SumReducer.java

#-----

javac -classpath 
/usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar:WCclasses
 -d WC-classes WordCount.java

#----

By using the code above we will be able to create class files of SumReducer, WordMapper & WordCount
What these programs essentially does is : we are having three large text files with lot of words. We are going to reduce this humongous task using reducer and mapper program

As we now have class files of SumReducer, WordMapper and WordCount we should create jar file using the following code.
<code> jar -cvf WordCount.jar -C WCclasses/ .</code>

Next step is to create folder in hdfs file system using the following commands.

<code>hdfs -mkdir user/ru1/wc-input</code>

After creating this folder we have to upload files using hue file browser using 127.0.0.1:8000 in our web browser.
After uploading files through file browser. It looks as follows.

Now its time to execute hadoop jar file. Let’s use the following code for doing the same.
hadoop jar WordCount.jar WordCount /user/ru1/wc-input /user/ru1/wc-out

After it is executed without any errors we need track the status of application in the all applications page using 127.0.0.1:8088
The screen looks as follows

In this step we should see succeeded in the respective application. After confirming the success status we should open hue file browser where we will see a new folder created called wc-out2 (which we have given in shell command prompt).

In this folder there will be two files called success and part-r-0000. The part-r-0000 is where we can check the output of the program and how many words are there and what is the frequency of each word occurred.

Sunday, September 20, 2015

SAS-Made Easy

In this blog we will be dealing with the explanations of some sample SAS programs. I will be providing the questions and its solution with explanation

Ques : Using the SAS data set Hosp, use PROC PRINT to list observations for Subject
values of 5, 100, 150, and 200. Do this twice, once using OR operators and once using the IN operator.
Note: Subject is a numeric variable.

Sol:
First,lets understand the question clearly!!!
lets divide the question into two parts one is he asked us to use the Hosp data set and divide the observations for subject
and the next part is he asked us to repeat the same program using the In statement

lets write the program

Dividing into groups without using IN
libname a15020 "D:\PRAXIS\sas/a15020";
libname learn "D:/PRAXIS/sas/sas 9.2/data sets";
data a15020.hosp1;
set learn.hosp;
if subject eq 5 or subject eq 100 or subject eq 150 or subject eq 200;
run;
proc print data=a15020.prob2;
run;
This is program for the question asked let us understand the program step wise
Divide the program into lines -- there are 8 lines in the program
In the 1st line libname a15020 "D:\PRAXIS\sas/a15020";

in this we created a library name using the command libname and we gave it a name,name which we gave is a15020 which is followed by the location where you want to save the dataset permanently which should be there in " " and should end with a semi-column (;)

In the 2nd line libname learn "D:/PRAXIS/sas/sas 9.2/data sets";

I already stored Hosp data in the library in learn so to initialize it we are giving the location of the learn

3rd line data a15020.hosp1;
its the data step as we are storing data permanently in the a15020 library we are giving the dataset name along with the library name a15020.hosp1 ,here hosp1 is the dataset name

4th line set learn.hosp;
we are using set command to set the hosp dataset which we already stored in the learn library
SET is used to load data in to the new program by using the data which we already used

5th line if subject eq 5 or subject eq 100 or subject eq 150 or subject eq 200;

we are using if statement here
we are saying SAS give me the values of subject where values are 5 or 100 or 150 or 200

here eq is =(operator)

6th line run;
this will say to SAS that our data step came to an end
(remember to give the ; at the end of every step)

7 th step Proc print data=a15020.hosp1;

this will print the values which are stored in the dataset a15020.hosp1

8th step run;
As the proc step came to an end we will end the proc step using the run statement in the same way

we ended the data step

output :

but in the above program we wrote so many OR commands for printing 5,100,150,200
in order to avoid it we have a command called IN in SAS
program demonstrating IN function

data a15020.usingin;
set learn.hosp;
if subject in (5 100 150 200);
run;
proc print data=a15020.usingin;
run;

without using IN we wrote a long step
if subject eq 5 or subject eq 100 or subject eq 150 or subject eq 200
but using in we wrote in(5 100 150 200)
if subject in (5 100 150 200)---- this step tells SAS if you find subject the select the values 5 100 150 200 from the subject.

Ques: Using the Sales data set, create a new, temporary SAS data set containing Region and TotalSales plus a new variable called Weight with values of 1.5 for the North Region, 1.7 for the South Region, and 2.0 for the West and East Regions. Use a SELECT statement to do this.

Sol: Here the question is in the sales dataset we have region adn total sales for that we have to add a new variable called as weight and we have to provide respective values for weights and we have to do this by the use of select statement

data a15020.newsales;
set learn.sales(keep=Region TotalSales);
select ;
when (region eq 'North') weight =1.5;
when (region eq 'South') weight =1.7;
when (region eq 'East') weight =2.0;
when (region eq 'North') weight =2.0;
otherwise;
end;
run;
proc print data=a15020.newsales;
run;

lets understand the the simple program .......
we created a permanent dataset newsales in the library a15020,set the sales data set and keep only region and total sales variables in the sales dataset,next use the select statement
SELECT statement is an alternative to the if and ifelse statement , in the select statement we will use when ,here we are writing when region is north weight will be 1.5 and so on ... this will take values of weight for all the regions ,otherwise can be left empty or you can write if otherthan the above values do this.select statement will end with an end statement and we are ending data step and asking SAS to print our output

Output:

Ques: Count the number of missing values for the variables A, B, and C in the Missing data
set. Add the cumulative number of missing values to each observation (use variable
names MissA, MissB, and MissC). Use the MISSING function to test for the missing
values.

Sol: Here the question is count the missing values in the dataset missing data for variables A B C and name the missing values as MissA MissB MissC

before writing this program lets understand the sum statement in SAS then this program is simple

variable + increment;
the above statement will do the following things
Variable is retained
Variable is initialized at 0
Missing values (of increment) are ignored

data a15020.missing;
infile "D:/PRAXIS/sas/SAS BY EXAMPLE/missing.txt" missover;
input A B C;
if missing(A) then MissA+1;
if missing(B) then MissB+1;
if missing(C) then MissC+1;
run;
proc print data=a15020.missing;
run;

here upto input statement is same as above discussed then we are writing
if the value in A is missing then create a newvariable called MissA and then add one to it and then repeat for all the A values and store the final value in MissA

here if missing(A) will tell SAS if there is any missing values in A then
next is MissA+1 means we are creating a new variable missA and initializing it to 0 and then if a value is missing it will be counted
same will be repeated for missB and missC

Output:

Ques: You are testing three speed-reading methods (A, B, and C) by randomly assigning
10 subjects to each of the three methods. You are given the results as three lines of
reading speeds, each line representing the results from each of the three methods,respectively. Here are the results:
250 255 256 300 244 268 301 322 256 333
267 275 256 320 250 340 345 290 280 300
350 350 340 290 377 401 380 310 299 399
Create a temporary SAS data set from these three lines of data. Each observation should contain Method (A, B, or C), and Score. There should be 30 observations in this data set. Use a DO loop to create the Method variable and remember to use a single trailing @ in your INPUT statement. Provide a listing of this data set using PROC PRINT.

Sol: Here we have to place all the 30 observations we have in the three methods A B C by using the do command

data a15020.speed;
do method= 'A','B','C';
do n=1 to 10;
input score @;
output;
end;
end;
datalines;
250 255 256 300 244 268 301 322 256 333
267 275 256 320 250 340 345 290 280 300
350 350 340 290 377 401 380 310 299 399
;
proc print data=a15020.speed;
run;

we are using do and saying SAS create methods with names A B C and then for every method do
for 10 times thats the reason we gave
do method ='A','B','C';
do n=1 to 10;
and we are asking SAS to print the output at each level so we gave output compand and do statement should end with a end command so we used two do statements to end them we used two end statements

here for the input score we are using score @
input score @;

we used this @ to say SAS that the input we are providing is in a single line and not this multiple lines
if we are not providing @ then SAS will take only three values for A they are 250 267 350
so in order to take all the 10 values which are there in a single line we are using @ sign

we are datalines statement to provide the data in the SAS environment directly
and then printing the data.

Output:

Ques : Using the Hosp data set, compute the subject’s ages two ways: as of January 1, 2006
(call it AgeJan1), and as of today’s date (call it AgeToday). The variable DOB
represents the date of birth. Take the integer portion of both ages. List the first 10
observations.

Sol: the problem is using the hosp dataset create two ages one untill jan1 2006 and one more age is today and only take the integer portion of the age .calculate the ages based on the DOB variable in the hosp dataset

libname learn "D:/PRAXIS/sas/sas 9.2/data sets";
data a15020.hosp3;
set learn.hosp;
agejan1=int(yrdif(dob,'01JAN2006'd,'actual'));
agetoday=int(yrdif(dob,today(),'actual'));
proc print data=a15020.hosp1 (obs=10);
var agejan1 dob agetoday;
run;

we are creating a dataset hosp3 and we are setting the hosp data set using the set function as discussed
next we are creating a new variable
agejan1 in that we are calculating the age difference by using function yrdif
yrdif contains three arguments (variable on which you want to do analysis, what you want to calculate , 'actual values or not)
actual values means it will take exact number of days means for sep it will take 30 days and for oct it will take 31 days
if we are not providing actual then for every month it will take 30 days and calculate

if we want calculate yrdif for a particular date for example January 1 2009 to give this date in SAS there is particular format '01Jan2009'd

so in agejan1 variable we rounding the calculated values to the nearest integer

if we want to calculate age for today then use today() function .

and in the proc print we are using obs=10 which will print only the first 10 observations of the hosp dataset and var function which says SAS print only the variables specified in the VAR statement.

Output:

Ques:You want to see each patient in the Medical data set on the same day of the week 5 weeks after they visited the clinic (the variable name is VisitDate). Provide a listing of the patient number (Patno), the visit date, and the date for the return visit.

Sol: The problem here is for a hospital which has the data of patients stored in the medical dataset
and the values of the patient visit has been recorded and the doctor wants to see the patient after 5 weeks so we have to calculate the next visit date which is after 5 weeks by the patient

data a15020.returnvisit;
set learn.medical;
return_visit=intnx('week',visitdate,5,'sameday');
proc print data=a15020.returnvisit;
format return_visit mmddyy10.;
run;

here we are using a function INTNX which is used to calculate the date after a certain interval so the first variable in the INTNX function is week as we want to calculate date after 5 weeks (if we want to calculate months instead of weeks use month instead of week) and second is our variable on which the operations should be performed ,third argument is after how many weeks we want the date in this problem we want after 5 weeks so third argument is 5 , and fourth argument is we are asking SAS to print the date on the same day not on the starting of the month .

Output:

Ques: Using the SAS data set Bicycles, create two temporary SAS data sets as follows: Mountain_USA consists of all observations from Bicycles where Country is USA and Model is Mountain. Road_France consists of all observations from Bicycles where Country is France and Model is Road Bike. Print these two data sets.

Sol: Till now we have seen creating only one dataset using a data statement but lets create two datasets by using a single datastep and print the output ,the problem here says us to separate the Country and model from bicycles dataset and print in their respective outputs Mountain_USA and Road_France

libname a15020 "D:/PRAXIS/sas/a15020";
data a15020.mountain_usa a15020.road_france;
set learn.bicycles;
if country eq 'USA' and model eq 'Mountain Bike' then output a15020.mountain_usa;
else if country eq 'France' and model eq 'Road Bike' then output a15020.road_france;
run;
proc print data=a15020.mountain_usa;
run;
proc pritn data=a15020.road_france;
run;
Here in the data step we are creating two permanant datasets mountain_usa and road_france in the a15020 library and then we are using the set function to set the data in the bicycles to this dataset

and then we are seperating the data by using the if condition
if country is usa and model is mountain bike then keep that value in the mountain_usa file is the if statement which we gave in the step there
and then to print the output specify the dataset which we the output to be printed

Output: for Mountain_usa

for Road_france:

Ques:Using the data set Sales, create the report shown here:
Sales Figures from the SALES Data Set
                            Total
Region Quantity Sales
East        100       699.0
               100       899.0
               500       19995.0
                  2        20000.0
------ -------- -------
East        702       41593.0

West      50         299.5
             1000      1990.0
------ -------- -------
West     1050      2289.5
======== =======
             2022     77113.0 in the specified format for all the specified regions.

Sol: The problem here states that divide the sales data set into speific regions and give the sum of the quantity and total sales region wise

Here's the problem

data a15020.region;
set learn.sales;
run;
proc sort data=a15020.region;
by region;
run;
proc print data=a15020.region;
by region;
var region quantity totalsales;
sum quantity totalsales;
run;

here we are creating a permanant dataset region in the library a15020 and we are setting the values which are present in the dataset sales.
To divide the data based on the regions first we have to sort the data based on the regions
to sort data use proc sort step followed by a BY statement which will say sort the data based on the variable specified in the BY statement . so here we want data to be segregated a region so we gave region as a BY variable
then print the data by region and we want only two variables quantity and totalsales
so to get only these two variables use var statement which contains quantity and totalsales;
and use sum to sum these variables column wise and then end the step using a run statement

Output:

Ques: Using the data set Blood, produce a report like the one here. The numbers in the table
are the average WBC and RBC counts for each combination of blood type and
gender.
Average Blood Counts by Gender
         ----Gender--- ---Gender----
Blood Female Male Female Male
Type WBC WBC RBC RBC
A       7,218 7,051 5.47 5.46
AB     7,421 6,893 5.43 5.69
B 6,716 6,991 5.52 5.42
O       7,050 6,930 5.53 5.48

Sol:
Here the problem states that segregate and print the values in the desired format .

for this type of segregation we will use proc report

data a15020.blood;
set learn.blood;
run;
proc report data=a15020.blood;
column blood_grp gender,wbc gender,rbc;
define blood_grp / group;
define gender/across width=8;
define wbc/analysis mean ;
define rbc /analysis mean;
run;

In proc report to specify which values to print we haev to give column option which is similar to the var option in the proc print statement
According to our ques we have to segregate gender and in gender we want wbc and rbc seperatly so
column we will provide gender,wbc and gender,rbc which will be helpful in the further steps
next we will define the blood_grp and /group is we are saying group all the variables based on the blood_group
next define gender /across here across is used to split the data column wise
next define wbc/analysis here analysis is used as its a numeric variable and the analysis we want to perform is mean but the default analysis is sum so we are specifying mean
same with rbc and end the proc report step with a run statement

Output:

you can find some more programs by downloading from the below link
https://drive.google.com/file/d/0B1v-Qh27ObFVc3lxMHVoR2VJSlU/view?usp=sharing