R interface
Step1. Install R-Base
we begin with installation of R base programming language by simply dropping few line into terminal as seen below
sudo apt-get update
sudo apt-get install r-base r-base-dev
Step2. Installing R-Studio
sudo apt-get install gdebi-core
https://download1.rstudio.org/rstudio-xenial-1.1.423-amd64.deb
sudo gdebi -n rstudio-xenial-1.1.423-amd64.deb
i. Make sure that SPARK_HOME is set in environment (using Sys.getenv)
ii. Load the SparkR library
iii.Initiate a sparkR.session
as shown in R studio below the three aforementioned step with code
# setting spark
#install relevent packages
install.packages("sparklyr")
library(sparklyr)
install.packages("dplyr")
library(dplyr)
#install local spark version in hadoop ecosystem
spark_install(version = "2.1.1")
#SPARK_HOME is set in environment
Sys.setenv(SPARK_HOME='/usr/local/spark')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#Initiate a sparkR.session
sc <- spark_connect(master = "local")
library(SparkR)#install relevent packages
install.packages("sparklyr")
library(sparklyr)
install.packages("dplyr")
library(dplyr)
#install local spark version in hadoop ecosystem
spark_install(version = "2.1.1")
#SPARK_HOME is set in environment
Sys.setenv(SPARK_HOME='/usr/local/spark')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#Initiate a sparkR.session
sc <- spark_connect(master = "local")
when this step done correctly spark will be connected to r studio and spark tables will be shown in top right hand corner in connections as figure below.
now spark tables are present only matter of time we start working on them.
let say we are going to work with table country but first cache table country that is force spark table to be loaded into memory
# Cache spark table country
tbl_cache(sc, 'country')
countries_list<- tbl(sc, 'country')
at this stage one can run table contents analysis tbl_cache(sc, 'country')
countries_list<- tbl(sc, 'country')














