Wednesday, July 11, 2018

How to connect R with Apache spark

R interface 

Step1. Install R-Base
we begin with installation of R base programming language by simply dropping few line into terminal as seen below

sudo apt-get update
sudo apt-get install r-base r-base-dev

Step2. Installing R-Studio

sudo apt-get install gdebi-core
https://download1.rstudio.org/rstudio-xenial-1.1.423-amd64.deb
sudo gdebi -n rstudio-xenial-1.1.423-amd64.deb
when done with R base and R studio in order to be able to work with sparkR one must do the following:
i. Make sure that SPARK_HOME is set in environment (using Sys.getenv)
ii. Load the SparkR library
iii.Initiate a sparkR.session

as shown in R studio below the three aforementioned step with code

# setting spark

#install relevent packages
install.packages("sparklyr")
library(sparklyr)
install.packages("dplyr") 
library(dplyr)

#install local spark version in hadoop ecosystem
spark_install(version = "2.1.1")

#SPARK_HOME is set in environment
Sys.setenv(SPARK_HOME='/usr/local/spark')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#Initiate a sparkR.session
sc <- spark_connect(master = "local")
library(SparkR)


when this step done correctly spark will be connected to r studio  and spark tables will be shown in top right hand corner in connections as figure below.



now spark tables are present only matter of time we start working on them.
let say we are going to work with table country but first cache table country that is force spark table to be loaded into memory



# Cache spark table country
tbl_cache(sc, 'country')
countries_list<- tbl(sc, 'country')
at this stage one can run table contents analysis



Monday, June 25, 2018

Apache spark-Work


 Spark work

 Starting from where we left it last time  assuming both hive and spark configuration are ready to go  show all the tables in hive data warehouse.
Step 1: Create spark sqlcontext to read hive metastore
import org.apache.spark.sparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
the result will be something like image below
Step 2: Create dataframe from hive external table
hive tables are in front of us so let us  create spark table by querying country table. i want to query only countries with population exceeds 200 million

 Step3: Create your own table
we will create three data frame and use it for multiple purposes
data frame fruit1

data frame fruit2


join fruit1 and fruit2


union the three tables

Step4: Save the last table

i- created  temporary table

fruits.createOrReplaceTempView("fruitsTable")

ii- use hive statement to create table and dump the data from your temp table.

sqlContext.sql("create table fruitsDeit as select * from fruitsTable");

this save directly in hive metastore to confirm let check back hive

 table fruitsdiet is there

iii- save to hdfs

fruits.select("id", "name", "diet").write.save("/user/hduser/fruittable.parquet")

take a look web-interface

Thursday, June 21, 2018

Apache spark

Apache spark can run on it's own  as standalone but here we will be running it on Hadoop yarn. first let get done with the configuration part

1.Download and Install Spark Binaries


hduser@ubuntu:~$  wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$  sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
hduser@ubuntu:~$ cd /usr/local
hduser@ubuntu:~$ sudo chown -R kui:hd spark

2.Add the Spark binaries directory to your PATH   

PATH=/usr/local/spark/bin:$PATH
3. for apache spark to communicate with resource manager (YARN) it need to know details of your Hadoop configuration and that mean Hadoop configuration directory variable(HADOOP_CONF_DIR) has to be present with spark variables.

hduser@ubuntu:~$ sudo vi $HOME/.bashrc

#SPARK_VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:SPARK_HOME/bin
export PATH=$PATH:SPARK_HOME/sbin
export PATH=$SPARK_HOME:$PATH
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native:$LD_LIBRARY_PATH
4.Configure spark-env.sh file

hduser@ubuntu:~$  /usr/local/spark/conf$ sudo vi spark-env.sh

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_HOME=$HADOOP_INSTALL
5-Create the log directory in HDFS:

 hduser@ubuntu:~$ hdfs dfs -mkdir -p  /user/spark-logs

6. start spark shell

it seems that our spark is on but let us take close look if we can access apache hive tables
this is going to need little bite of work to get to hive tables specifically we need two things
i- Spark HiveContext
ii- To run SQL operation on we need to create ( SQLContext )
to do that put following code

import org.apache.spark.SparkConf                                               
import org.apache.spark.sql.hive.HiveContext                                      
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")                                                            
val sqlContext = new HiveContext(sc)                                             
import sqlContext.implicits._   
let's do just that
 right now spark is blind to hive tables
 there is a reason not showing what in hive because it missing one last things
let us fix this or give it pair of glasses as people would say.
6.to show hive tables in spark properly : copy hive-site.xml to spark conf directory


sudo cp /usr/local/hive/conf/hive-site.xml  /usr/local/spark/conf
check back apache spark
worked fine
good luck with spark conf

How to connect R with Apache spark

R interface  Step1. Install R-Base we begin with installation of R base programming language by simply dropping few line into terminal a...