Monday, June 25, 2018

Apache spark-Work


 Spark work

 Starting from where we left it last time  assuming both hive and spark configuration are ready to go  show all the tables in hive data warehouse.
Step 1: Create spark sqlcontext to read hive metastore
import org.apache.spark.sparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
the result will be something like image below
Step 2: Create dataframe from hive external table
hive tables are in front of us so let us  create spark table by querying country table. i want to query only countries with population exceeds 200 million

 Step3: Create your own table
we will create three data frame and use it for multiple purposes
data frame fruit1

data frame fruit2


join fruit1 and fruit2


union the three tables

Step4: Save the last table

i- created  temporary table

fruits.createOrReplaceTempView("fruitsTable")

ii- use hive statement to create table and dump the data from your temp table.

sqlContext.sql("create table fruitsDeit as select * from fruitsTable");

this save directly in hive metastore to confirm let check back hive

 table fruitsdiet is there

iii- save to hdfs

fruits.select("id", "name", "diet").write.save("/user/hduser/fruittable.parquet")

take a look web-interface

Thursday, June 21, 2018

Apache spark

Apache spark can run on it's own  as standalone but here we will be running it on Hadoop yarn. first let get done with the configuration part

1.Download and Install Spark Binaries


hduser@ubuntu:~$  wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$  sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
hduser@ubuntu:~$ cd /usr/local
hduser@ubuntu:~$ sudo chown -R kui:hd spark

2.Add the Spark binaries directory to your PATH   

PATH=/usr/local/spark/bin:$PATH
3. for apache spark to communicate with resource manager (YARN) it need to know details of your Hadoop configuration and that mean Hadoop configuration directory variable(HADOOP_CONF_DIR) has to be present with spark variables.

hduser@ubuntu:~$ sudo vi $HOME/.bashrc

#SPARK_VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:SPARK_HOME/bin
export PATH=$PATH:SPARK_HOME/sbin
export PATH=$SPARK_HOME:$PATH
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native:$LD_LIBRARY_PATH
4.Configure spark-env.sh file

hduser@ubuntu:~$  /usr/local/spark/conf$ sudo vi spark-env.sh

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_HOME=$HADOOP_INSTALL
5-Create the log directory in HDFS:

 hduser@ubuntu:~$ hdfs dfs -mkdir -p  /user/spark-logs

6. start spark shell

it seems that our spark is on but let us take close look if we can access apache hive tables
this is going to need little bite of work to get to hive tables specifically we need two things
i- Spark HiveContext
ii- To run SQL operation on we need to create ( SQLContext )
to do that put following code

import org.apache.spark.SparkConf                                               
import org.apache.spark.sql.hive.HiveContext                                      
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")                                                            
val sqlContext = new HiveContext(sc)                                             
import sqlContext.implicits._   
let's do just that
 right now spark is blind to hive tables
 there is a reason not showing what in hive because it missing one last things
let us fix this or give it pair of glasses as people would say.
6.to show hive tables in spark properly : copy hive-site.xml to spark conf directory


sudo cp /usr/local/hive/conf/hive-site.xml  /usr/local/spark/conf
check back apache spark
worked fine
good luck with spark conf

Sunday, June 17, 2018

How to Work with Apache Hive

1. Creating external table in hive

first step let us check if there is any table at all in hive warehouse
because after configuration is really on the user to either create or put data on hive warehouse to be used  and analyzed  or both
 obviously this newly configured hive don't contain a single table so let create a table then
for demonstration i will put together table and call it  human

wow we have our table then  moving on to the next level

1.2 loading data table  to hive from HDFS without removing the source file
in earlier publication we have ingested data into Hadoop there were multiple of tables however, i will go for  table called productsubcategory  because it has least variables in it and there two ways to do that

i- load from hdfs to hive data warehouse without removing the source table from hdfs
first we have to survey the table structure in mysql  and then create table

hd@ubuntu:~$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 12
Server version: 5.7.22-0ubuntu0.16.04.1 (Ubuntu)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| adventureworks     |
| menagerie          |
| metastore          |
| mysql              |
| performance_schema |
| sys                |
| world              |
| world_x            |
| worldx             |
+--------------------+
10 rows in set (0.00 sec)

mysql> use adventureworks;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------------------------------+
| Tables_in_adventureworks                      |
+-----------------------------------------------+
| adventureworksdwbuildversion                  |
| databaselog                                   |
| dimaccount                                    |
| dimcurrency                                   |
| dimcustomer                                   |
| dimdate                                       |
| dimdepartmentgroup                            |
| dimemployee                                   |
| dimgeography                                  |
| dimorganization                               |
| dimproduct                                    |
| dimproductcategory                            |
| dimproductsubcategory                         |
| dimpromotion                                  |
| dimreseller                                   |
| dimsalesreason                                |
| dimsalesterritory                             |
| dimscenario                                   |
| factadditionalinternationalproductdescription |
| factcallcenter                                |
| factcurrencyrate                              |
| factfinance                                   |
| factinternetsales                             |
| factinternetsalesreason                       |
| factproductinventory                          |
| factresellersales                             |
| factsalesquota                                |
| factsurveyresponse                            |
| newfactcurrencyrate                           |
| prospectivebuyer                              |
| sysdiagrams                                   |
+-----------------------------------------------+
31 rows in set (0.00 sec)

mysql> describe dimproductsubcategory;
+--------------------------------+-------------+------+-----+---------+----------------+
| Field                          | Type        | Null | Key | Default | Extra          |
+--------------------------------+-------------+------+-----+---------+----------------+
| ProductSubcategoryKey          | int(11)     | NO   | PRI | NULL    | auto_increment |
| ProductSubcategoryAlternateKey | int(11)     | YES  | UNI | NULL    |                |
| EnglishProductSubcategoryName  | varchar(50) | NO   |     | NULL    |                |
| SpanishProductSubcategoryName  | varchar(50) | NO   |     | NULL    |                |
| FrenchProductSubcategoryName   | varchar(50) | NO   |     | NULL    |                |
| ProductCategoryKey             | int(11)     | YES  | MUL | NULL    |                |
+--------------------------------+-------------+------+-----+---------+----------------+
6 rows in set (0.00 sec)

mysql>

after careful examination and learning the variables then create table productsubcategory
here is how

ii -  load table from mysql to hive data warehouse in case your table is in mysql the tool is famous SQOOP
load  table country from database world to hive DATA WAREHOUSE

sqoop import --connect jdbc:mysql://localhost/world --username root --password (mysql password) --table country --hive-import --hive-table country -m 1

observe the table in hive

how many  country name in the table take glimpse of hive inner working
i gave the count the name instruction and came up with below number

 next stop will be either data analysis in hive or move to apache spark


Saturday, June 9, 2018

Using Sqoop to Ingest Data Into Hadoop

To do this Hadoop Eco-system should be ready configured with following
i- Hadoop- Yarn
ii- Apache sqoop
iii- apache hive
iv- Mysql with database
the last one just use your preferred method to put some data in mysql. either using mysql workbench to draw data to mysql or create database and create table or two in it 
if not please take look at earlier configurations to complete
when ready to make next move let sqoop some data. for this example i used adventurewoks database example because it has multiple tables and available for download .

1. create noticeable directory in hdfs (Hadoop Distributed File System ) and let give it name (mydata)



2. folder mydata now rest in hdfs and we want to ingest data into that folder using our famous sqoop either one table at time or all database tables at once and we will try both to know how it works.
i- one table instance

sqoop import --connect "jdbc:mysql://localhost/(databasename)" --username root --password (mysql password) --warehouse-dir /user/--(tablename) -m 1
lets take look at databases in mysql first shall we

we will be using two databases, country from world for single table and adventurewoks for all tables command

after executing above command let us take look at mydata contents


yes table county was ingested into hdfs

ii- all tables instance

sqoop import-all-tables --connect jdbc:mysql://localhost/adventureworks --username root --password (mysqlpassword) --as-textfile -m 1 --warehouse-dir /user/hduser/mydata
the above command with database name and password sorted  let's take a look inside our mydata folder

this is what we got
it ingested 31 tables of adventureworks into our poor folder along with table country  which we did it earlier. if you got it like above roll up your sleeves the work just started

Sunday, June 3, 2018

Apache sqoop

 SQOOP CONFIGURATION

To install and configure apache sqoop server first step is to find most stable version that compatible with already configured Hadoop in your system
in my case hadoop 2.7.3
1. Download Sqoop

wget http://www-eu.apache.org/dist/sqoop/1.99.7/sqoop-1.99.7-bin-hadoop200.tar.gz

sudo tar -zxvf sqoop-1.99.7-bin-hadoop200.tar.gz
sudo mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /usr/local/sqoop
sudo chown -R kui:hd sqoop
sudo chmod 777 -R /usr/local/sqoop
 2.Set Environment Variables
Go the $HOME/.bashrc to set the variables using sudo vi

export SQOOP_HOME=/usr/local/sqoop
export SQOOP_CONF_DIR=$SQOOP_HOME/conf
export SQOOP_CLASSPATH=$SQOOP_CONF_DIR
export PATH=$SQOOP_HOME/bin:$PATH
export PATH=$PATH:/usr/local/sqoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export HBASE_HOME=/usr/local/hbase
3. Configuring Sqoop
For sqoop-env.sh the file have to be copied from sqoop-env-template.sh because initially itn't there
so use bellow to create one

cp sqoop-env-template.sh  sqoop-env.sh
 cd $SQOOP_HOME/conf
sudo vi  sqoop-env.sh
 export HADOOP_COMMON_HOME=/usr/local/hadoop
 export HADOOP_MAPRED_HOME=/usr/local/hadoop
4.Check Sqoop Version
$ sqoop version

5. create folder in /usr/lib
sudo mkdir /usr/lib/sqoop

6. Put DB-Connection-Library Into $SQOOP_HOME/lib
 To place mysql-connector library into sqoop home to communicate with 
wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.45.tar.gz
sudo tar -zxvf mysql-connector-java-5.1.45.tar.gz
sudo cp mysql-connector-java-5.1.45/mysql-connector-java-5.1.45-bin.jar /usr/local/sqoop/lib
finally test sqoop can connect to MYSQL  and how to list existing databases in mysql


Now how to work with sqoop --- next
Apache sqoop is mastermind behind data movement between relational databases and Hadoop
 this true in case of hadoop data warehouse or apache hive you can do
i- direct ingestion to hdfs
ii- dirct ingestion to hive data warehouse
iii- from hdfs to hive
in all cases you should receive intended output

How to connect R with Apache spark

R interface  Step1. Install R-Base we begin with installation of R base programming language by simply dropping few line into terminal a...