Analytics Blog: Apache spark

Apache spark can run on it's own as standalone but here we will be running it on Hadoop yarn. first let get done with the configuration part

1.Download and Install Spark Binaries

hduser@ubuntu:~$ wget https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

hduser@ubuntu:~$ sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
hduser@ubuntu:~$ cd /usr/local
hduser@ubuntu:~$ sudo chown -R kui:hd spark

2.Add the Spark binaries directory to your PATH

PATH=/usr/local/spark/bin:$PATH

3. for apache spark to communicate with resource manager (YARN) it need to know details of your Hadoop configuration and that mean Hadoop configuration directory variable(HADOOP_CONF_DIR) has to be present with spark variables.

hduser@ubuntu:~$ sudo vi $HOME/.bashrc

#SPARK_VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:SPARK_HOME/bin
export PATH=$PATH:SPARK_HOME/sbin
export PATH=$SPARK_HOME:$PATH
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native:$LD_LIBRARY_PATH

4.Configure spark-env.sh file

hduser@ubuntu:~$ /usr/local/spark/conf$ sudo vi spark-env.sh

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_HOME=$HADOOP_INSTALL

5-Create the log directory in HDFS:

hduser@ubuntu:~$ hdfs dfs -mkdir -p /user/spark-logs

6. start spark shell

it seems that our spark is on but let us take close look if we can access apache hive tables

this is going to need little bite of work to get to hive tables specifically we need two things

i- Spark HiveContext

ii- To run SQL operation on we need to create ( SQLContext )

to do that put following code

import org.apache.spark.SparkConf
import org.apache.spark.sql.hive.HiveContext
val conf = new SparkConf().setAppName("Test").setMaster("yarn-client")
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

let's do just that

right now spark is blind to hive tables

there is a reason not showing what in hive because it missing one last things

let us fix this or give it pair of glasses as people would say.

6.to show hive tables in spark properly : copy hive-site.xml to spark conf directory

sudo cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf

check back apache spark

worked fine

good luck with spark conf

Analytics Blog

Thursday, June 21, 2018

Apache spark

No comments:

Post a Comment

How to connect R with Apache spark

Report Abuse