Wednesday, July 11, 2018

How to connect R with Apache spark

R interface 

Step1. Install R-Base
we begin with installation of R base programming language by simply dropping few line into terminal as seen below

sudo apt-get update
sudo apt-get install r-base r-base-dev

Step2. Installing R-Studio

sudo apt-get install gdebi-core
https://download1.rstudio.org/rstudio-xenial-1.1.423-amd64.deb
sudo gdebi -n rstudio-xenial-1.1.423-amd64.deb
when done with R base and R studio in order to be able to work with sparkR one must do the following:
i. Make sure that SPARK_HOME is set in environment (using Sys.getenv)
ii. Load the SparkR library
iii.Initiate a sparkR.session

as shown in R studio below the three aforementioned step with code

# setting spark

#install relevent packages
install.packages("sparklyr")
library(sparklyr)
install.packages("dplyr") 
library(dplyr)

#install local spark version in hadoop ecosystem
spark_install(version = "2.1.1")

#SPARK_HOME is set in environment
Sys.setenv(SPARK_HOME='/usr/local/spark')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#Initiate a sparkR.session
sc <- spark_connect(master = "local")
library(SparkR)


when this step done correctly spark will be connected to r studio  and spark tables will be shown in top right hand corner in connections as figure below.



now spark tables are present only matter of time we start working on them.
let say we are going to work with table country but first cache table country that is force spark table to be loaded into memory



# Cache spark table country
tbl_cache(sc, 'country')
countries_list<- tbl(sc, 'country')
at this stage one can run table contents analysis



How to connect R with Apache spark

R interface  Step1. Install R-Base we begin with installation of R base programming language by simply dropping few line into terminal a...