Analytics Blog: June 2018

Monday, June 25, 2018

Apache spark-Work

Spark work

Starting from where we left it last time assuming both hive and spark configuration are ready to go show all the tables in hive data warehouse.

Step 1: Create spark sqlcontext to read hive metastore

import org.apache.spark.sparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

the result will be something like image below

Step 2: Create dataframe from hive external table

hive tables are in front of us so let us create spark table by querying country table. i want to query only countries with population exceeds 200 million

Step3: Create your own table

we will create three data frame and use it for multiple purposes

data frame fruit1

data frame fruit2

join fruit1 and fruit2

union the three tables

Step4: Save the last table

i- created temporary table

fruits.createOrReplaceTempView("fruitsTable")

ii- use hive statement to create table and dump the data from your temp table.

sqlContext.sql("create table fruitsDeit as select * from fruitsTable");

this save directly in hive metastore to confirm let check back hive

table fruitsdiet is there

iii- save to hdfs

fruits.select("id", "name", "diet").write.save("/user/hduser/fruittable.parquet")

take a look web-interface

Thursday, June 21, 2018

Apache spark

Apache spark can run on it's own as standalone but here we will be running it on Hadoop yarn. first let get done with the configuration part

1.Download and Install Spark Binaries

hduser@ubuntu:~$ wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
hduser@ubuntu:~$ cd /usr/local
hduser@ubuntu:~$ sudo chown -R kui:hd spark

2.Add the Spark binaries directory to your PATH

PATH=/usr/local/spark/bin:$PATH

3. for apache spark to communicate with resource manager (YARN) it need to know details of your Hadoop configuration and that mean Hadoop configuration directory variable(HADOOP_CONF_DIR) has to be present with spark variables.

hduser@ubuntu:~$ sudo vi $HOME/.bashrc

#SPARK_VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:SPARK_HOME/bin
export PATH=$PATH:SPARK_HOME/sbin
export PATH=$SPARK_HOME:$PATH
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native:$LD_LIBRARY_PATH

4.Configure spark-env.sh file

hduser@ubuntu:~$ /usr/local/spark/conf$ sudo vi spark-env.sh

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_HOME=$HADOOP_INSTALL

5-Create the log directory in HDFS:

hduser@ubuntu:~$ hdfs dfs -mkdir -p /user/spark-logs

6. start spark shell

it seems that our spark is on but let us take close look if we can access apache hive tables

this is going to need little bite of work to get to hive tables specifically we need two things

i- Spark HiveContext

ii- To run SQL operation on we need to create ( SQLContext )

to do that put following code

import org.apache.spark.SparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

let's do just that

right now spark is blind to hive tables

there is a reason not showing what in hive because it missing one last things

let us fix this or give it pair of glasses as people would say.

6.to show hive tables in spark properly : copy hive-site.xml to spark conf directory

sudo cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf

check back apache spark

worked fine

good luck with spark conf

Sunday, June 17, 2018

How to Work with Apache Hive

1. Creating external table in hive

first step let us check if there is any table at all in hive warehouse
because after configuration is really on the user to either create or put data on hive warehouse to be used and analyzed or both

obviously this newly configured hive don't contain a single table so let create a table then
for demonstration i will put together table and call it human

wow we have our table then moving on to the next level

1.2 loading data table to hive from HDFS without removing the source file
in earlier publication we have ingested data into Hadoop there were multiple of tables however, i will go for table called productsubcategory because it has least variables in it and there two ways to do that

i- load from hdfs to hive data warehouse without removing the source table from hdfs
first we have to survey the table structure in mysql and then create table

after careful examination and learning the variables then create table productsubcategory
here is how

ii - load table from mysql to hive data warehouse in case your table is in mysql the tool is famous SQOOP
load table country from database world to hive DATA WAREHOUSE

sqoop import --connect jdbc:mysql://localhost/world --username root --password (mysql password) --table country --hive-import --hive-table country -m 1

observe the table in hive

how many country name in the table take glimpse of hive inner working
i gave the count the name instruction and came up with below number

next stop will be either data analysis in hive or move to apache spark

Saturday, June 9, 2018

Using Sqoop to Ingest Data Into Hadoop

To do this Hadoop Eco-system should be ready configured with following
i- Hadoop- Yarn
ii- Apache sqoop
iii- apache hive
iv- Mysql with database
the last one just use your preferred method to put some data in mysql. either using mysql workbench to draw data to mysql or create database and create table or two in it
if not please take look at earlier configurations to complete
when ready to make next move let sqoop some data. for this example i used adventurewoks database example because it has multiple tables and available for download .

1. create noticeable directory in hdfs (Hadoop Distributed File System ) and let give it name (mydata)

2. folder mydata now rest in hdfs and we want to ingest data into that folder using our famous sqoop either one table at time or all database tables at once and we will try both to know how it works.
i- one table instance

sqoop import --connect "jdbc:mysql://localhost/(databasename)" --username root --password (mysql password) --warehouse-dir /user/--(tablename) -m 1

lets take look at databases in mysql first shall we

we will be using two databases, country from world for single table and adventurewoks for all tables command

after executing above command let us take look at mydata contents

yes table county was ingested into hdfs

ii- all tables instance

sqoop import-all-tables --connect jdbc:mysql://localhost/adventureworks --username root --password (mysqlpassword) --as-textfile -m 1 --warehouse-dir /user/hduser/mydata

the above command with database name and password sorted let's take a look inside our mydata folder

this is what we got

it ingested 31 tables of adventureworks into our poor folder along with table country which we did it earlier. if you got it like above roll up your sleeves the work just started

Sunday, June 3, 2018

Apache sqoop

SQOOP CONFIGURATION

To install and configure apache sqoop server first step is to find most stable version that compatible with already configured Hadoop in your system
in my case hadoop 2.7.3
1. Download Sqoop

wget http://www-eu.apache.org/dist/sqoop/1.99.7/sqoop-1.99.7-bin-hadoop200.tar.gz

sudo tar -zxvf sqoop-1.99.7-bin-hadoop200.tar.gz
sudo mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /usr/local/sqoop
sudo chown -R kui:hd sqoop
sudo chmod 777 -R /usr/local/sqoop

2.Set Environment Variables
Go the $HOME/.bashrc to set the variables using sudo vi

export SQOOP_HOME=/usr/local/sqoop
export SQOOP_CONF_DIR=$SQOOP_HOME/conf
export SQOOP_CLASSPATH=$SQOOP_CONF_DIR
export PATH=$SQOOP_HOME/bin:$PATH
export PATH=$PATH:/usr/local/sqoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export HBASE_HOME=/usr/local/hbase

3. Configuring Sqoop
For sqoop-env.sh the file have to be copied from sqoop-env-template.sh because initially itn't there
so use bellow to create one

cp sqoop-env-template.sh sqoop-env.sh

cd $SQOOP_HOME/conf

sudo vi sqoop-env.sh

export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop

4.Check Sqoop Version
$ sqoop version

5. create folder in /usr/lib
sudo mkdir /usr/lib/sqoop

6. Put DB-Connection-Library Into $SQOOP_HOME/lib
To place mysql-connector library into sqoop home to communicate with

wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.45.tar.gz
sudo tar -zxvf mysql-connector-java-5.1.45.tar.gz
sudo cp mysql-connector-java-5.1.45/mysql-connector-java-5.1.45-bin.jar /usr/local/sqoop/lib

finally test sqoop can connect to MYSQL and how to list existing databases in mysql

Now how to work with sqoop --- next
Apache sqoop is mastermind behind data movement between relational databases and Hadoop
this true in case of hadoop data warehouse or apache hive you can do
i- direct ingestion to hdfs
ii- dirct ingestion to hive data warehouse
iii- from hdfs to hive
in all cases you should receive intended output

Analytics Blog