1. Apache Spark
py4j
$ conda install pip
$ which pip
$ pip install py4j
Spark
$ cd $HOME
$ wget http://apache.mirror.cdnetworks.com/spark/spark-2.3.4/spark-2.3.4-bin-hadoop2.7.tgz
$ tar xvzf spark-2.3.4-bin-hadoop2.7.tgz
$ ln -s spark-2.3.4-bin-hadoop2.7 spark
$ mv spark-2.3.4-bin-hadoop2.7.tgz ./downloads/
$ cd $HOME/spark/conf
$ cp spark-env.sh.template spark-env.sh
$ nano spark-env.sh
다음의 내용을 아래에 덧붙여줍니다.
export SPARK_MASTER_WEBUI_PORT=9090
export SPARK_WORKER_WEBUI_PORT=9091
export HADOOP_CONF_DIR=/home/kjhov195/hadoop/etc/hadoop
$ nano ~/.bash_profile
다음과 같이 수정해줍니다.
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# Anaconda
export ANACONDA_HOME="/home/kjhov195/anaconda3"
# Java
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
# Hadoop
export HADOOP_HOME="/home/kjhov195/hadoop"
# Spark
export SPARK_HOME="/home/kjhov195/spark"
# PySpark
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
# Path
export PATH=${ANACONDA_HOME}/bin:${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:$PATH
환경변수를 적용해준다.
$ source ~/.bash_profile
2. Apache Hive
$ cd $HOME
$ wget https://archive.apache.org/dist/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz
$ tar xvzf apache-hive-2.3.4-bin.tar.gz
$ ln -s apache-hive-2.3.4-bin hive
$ mv apache-hive-2.3.4-bin.tar.gz ./downloads/
$ cd $HOME
$ cd hive/conf
$ cp hive-env.sh.template hive-env.sh
$ nano hive-env.sh
다음 부분이 주석처리가 되어있는데, 해당 부분을 찾아서 다음과 같이 수정해준다.
HADOOP_HOME=/home/kjhov195/hadoop
다음 파일을 만들어서
$ nano hive-site.xml
다음과 같이 수정해준다.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse/</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>
hive를 실행하기 전에, Hadoop을 먼저 실행해준다.
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
$ hdfs dfs -mkdir -p /tmp/hive
$ hdfs dfs -mkdir -p /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod 777 /tmp/hive
$ hdfs dfs -chmod g+w /user/hive
$ hdfs dfs -chmod g+w /user/hive/warehouse
$ cd $HOME/hive
$ ./bin/schematool -initSchema -dbType derby
$ cp $HOME/hive/conf/hive-site.xml $HOME/spark/conf/
$ ./bin/hive --service metastore &
jps를 입력하였을 때 다음과 같다면, 정상적으로 실행된 것이다.
21123 NameNode
21717 ResourceManager
21993 NodeManager
22745 RunJar
21305 DataNode
22830 Jps
21535 SecondaryNameNode
Reference