Thursday, June 21, 2018

Apache spark

Apache spark can run on it's own  as standalone but here we will be running it on Hadoop yarn. first let get done with the configuration part

1.Download and Install Spark Binaries


hduser@ubuntu:~$  wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$  sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
hduser@ubuntu:~$ cd /usr/local
hduser@ubuntu:~$ sudo chown -R kui:hd spark

2.Add the Spark binaries directory to your PATH   

PATH=/usr/local/spark/bin:$PATH
3. for apache spark to communicate with resource manager (YARN) it need to know details of your Hadoop configuration and that mean Hadoop configuration directory variable(HADOOP_CONF_DIR) has to be present with spark variables.

hduser@ubuntu:~$ sudo vi $HOME/.bashrc

#SPARK_VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:SPARK_HOME/bin
export PATH=$PATH:SPARK_HOME/sbin
export PATH=$SPARK_HOME:$PATH
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native:$LD_LIBRARY_PATH
4.Configure spark-env.sh file

hduser@ubuntu:~$  /usr/local/spark/conf$ sudo vi spark-env.sh

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_HOME=$HADOOP_INSTALL
5-Create the log directory in HDFS:

 hduser@ubuntu:~$ hdfs dfs -mkdir -p  /user/spark-logs

6. start spark shell

it seems that our spark is on but let us take close look if we can access apache hive tables
this is going to need little bite of work to get to hive tables specifically we need two things
i- Spark HiveContext
ii- To run SQL operation on we need to create ( SQLContext )
to do that put following code

import org.apache.spark.SparkConf                                               
import org.apache.spark.sql.hive.HiveContext                                      
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")                                                            
val sqlContext = new HiveContext(sc)                                             
import sqlContext.implicits._   
let's do just that
 right now spark is blind to hive tables
 there is a reason not showing what in hive because it missing one last things
let us fix this or give it pair of glasses as people would say.
6.to show hive tables in spark properly : copy hive-site.xml to spark conf directory


sudo cp /usr/local/hive/conf/hive-site.xml  /usr/local/spark/conf
check back apache spark
worked fine
good luck with spark conf

No comments:

Post a Comment

How to connect R with Apache spark

R interface  Step1. Install R-Base we begin with installation of R base programming language by simply dropping few line into terminal a...