Analytics Blog

Wednesday, July 11, 2018

How to connect R with Apache spark

R interface

Step1. Install R-Base
we begin with installation of R base programming language by simply dropping few line into terminal as seen below

sudo apt-get update
sudo apt-get install r-base r-base-dev

Step2. Installing R-Studio

sudo apt-get install gdebi-core
https://download1.rstudio.org/rstudio-xenial-1.1.423-amd64.deb
sudo gdebi -n rstudio-xenial-1.1.423-amd64.deb

when done with R base and R studio in order to be able to work with sparkR one must do the following:
i. Make sure that SPARK_HOME is set in environment (using Sys.getenv)
ii. Load the SparkR library
iii.Initiate a sparkR.session

as shown in R studio below the three aforementioned step with code

# setting spark

#install relevent packages
install.packages("sparklyr")
library(sparklyr)
install.packages("dplyr")
library(dplyr)

#install local spark version in hadoop ecosystem
spark_install(version = "2.1.1")

#SPARK_HOME is set in environment
Sys.setenv(SPARK_HOME='/usr/local/spark')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#Initiate a sparkR.session
sc <- spark_connect(master = "local")

library(SparkR)

when this step done correctly spark will be connected to r studio and spark tables will be shown in top right hand corner in connections as figure below.

now spark tables are present only matter of time we start working on them.
let say we are going to work with table country but first cache table country that is force spark table to be loaded into memory

# Cache spark table country
tbl_cache(sc, 'country')
countries_list<- tbl(sc, 'country')

at this stage one can run table contents analysis

Monday, June 25, 2018

Apache spark-Work

Spark work

Starting from where we left it last time assuming both hive and spark configuration are ready to go show all the tables in hive data warehouse.

Step 1: Create spark sqlcontext to read hive metastore

import org.apache.spark.sparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

the result will be something like image below

Step 2: Create dataframe from hive external table

hive tables are in front of us so let us create spark table by querying country table. i want to query only countries with population exceeds 200 million

Step3: Create your own table

we will create three data frame and use it for multiple purposes

data frame fruit1

data frame fruit2

join fruit1 and fruit2

union the three tables

Step4: Save the last table

i- created temporary table

fruits.createOrReplaceTempView("fruitsTable")

ii- use hive statement to create table and dump the data from your temp table.

sqlContext.sql("create table fruitsDeit as select * from fruitsTable");

this save directly in hive metastore to confirm let check back hive

table fruitsdiet is there

iii- save to hdfs

fruits.select("id", "name", "diet").write.save("/user/hduser/fruittable.parquet")

take a look web-interface

Thursday, June 21, 2018

Apache spark

Apache spark can run on it's own as standalone but here we will be running it on Hadoop yarn. first let get done with the configuration part

1.Download and Install Spark Binaries

hduser@ubuntu:~$ wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
hduser@ubuntu:~$ cd /usr/local
hduser@ubuntu:~$ sudo chown -R kui:hd spark

2.Add the Spark binaries directory to your PATH

PATH=/usr/local/spark/bin:$PATH

3. for apache spark to communicate with resource manager (YARN) it need to know details of your Hadoop configuration and that mean Hadoop configuration directory variable(HADOOP_CONF_DIR) has to be present with spark variables.

hduser@ubuntu:~$ sudo vi $HOME/.bashrc

#SPARK_VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:SPARK_HOME/bin
export PATH=$PATH:SPARK_HOME/sbin
export PATH=$SPARK_HOME:$PATH
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native:$LD_LIBRARY_PATH

4.Configure spark-env.sh file

hduser@ubuntu:~$ /usr/local/spark/conf$ sudo vi spark-env.sh

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_HOME=$HADOOP_INSTALL

5-Create the log directory in HDFS:

hduser@ubuntu:~$ hdfs dfs -mkdir -p /user/spark-logs

6. start spark shell

it seems that our spark is on but let us take close look if we can access apache hive tables

this is going to need little bite of work to get to hive tables specifically we need two things

i- Spark HiveContext

ii- To run SQL operation on we need to create ( SQLContext )

to do that put following code

import org.apache.spark.SparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

let's do just that

right now spark is blind to hive tables

there is a reason not showing what in hive because it missing one last things

let us fix this or give it pair of glasses as people would say.

6.to show hive tables in spark properly : copy hive-site.xml to spark conf directory

sudo cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf

check back apache spark

worked fine

good luck with spark conf