Deploy Hadoop on Kubernetes on Jetstream

kubernetes
jetstream
Published

October 12, 2021

We are deploying the good old Hadoop on top of Kubernetes on Jetstream. Don’t ask why.

As usual we start with a full-fledged Kubernetes deployment on Jetstream (1) deployed via Kubespray

Deploy Hadoop via helm

Fortunately we have a Helm chart which deploys all the Hadoop components. It is deprecated since November 2020, but it still works fine on Kubernetes 1.19.7.

Clone the usual repository with gh:

gh repo clone zonca/jupyterhub-deploy-kubernetes-jetstream
cd hadoop/

Verify the configuration in stable_hadoop_values.yaml, I’m currently keeping it simple, so no persistence.

Install Hadoop via Helm:

bash install_hadoop.sh

Once the pods are running, you should see:

> kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
hadoop-hadoop-hdfs-dn-0   1/1     Running   0          144m
hadoop-hadoop-hdfs-nn-0   1/1     Running   0          144m
hadoop-hadoop-yarn-nm-0   1/1     Running   0          144m
hadoop-hadoop-yarn-rm-0   1/1     Running   0          144m

Launch a test job

Get a terminal on the YARN node manager:

bash login_yarn.sh

You have now access to the Hadoop 2.9.0 cluster. Launch a test MapReduce job to compute pi:

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar pi 16 1000

Access the YARN Dashboard

You can also export the YARN dashboard from the cluster to your local machine.

bash expose_yarn.sh

Connect locally to port 8088 to check the status of the jobs.

Make sure this port is never exposed publicly. I learned the hard way that there are botnets scanning the internet and compromising the YARN service for crypto-mining, see this article for details.