Launch a shared dask cluster in Kubernetes alongside JupyterHub on Jetstream

Download source Contribute

Let’s assume we have already a Kubernetes deployment and have installed JupyterHub, see for example my previous tutorial on Jetstream. Now that users can login and access a Jupyter Notebook, we would also like to provide them more computing power for their interactive data exploration. The easiest way is through dask, we can launch a scheduler and any number of workers as containers inside Kubernetes so that users can leverage the computing power of many Jetstream instances at once.

There are 2 main strategies, we can give each user their own dask cluster with exclusive access and this would be more performant but cause quick spike of usage of the Kubernetes cluster, or just launch a shared cluster and give all users access to that.

In this tutorial we cover the second scenario, we’ll cover the first scenario in a following tutorial.

We will deploy first Jupyterhub through the Zero-to-JupyterHub guide, then launch via Helm a fixed size dask clusters and show how users can connect, submit distributed Python jobs and monitor their execution on the dashboard.

The configuration files mentioned in the tutorial are available in the Github repository zonca/jupyterhub-deploy-kubernetes-jetstream.

Deploy JupyterHub

First we start from Jupyterhub on Jetstream with Kubernetes at https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html

Optionally, for testing purposes, we can simplify the deployment by skipping permanent storage, if this is an option, see the relevant section below.

We want to install Jupyterhub in the pangeo namespace with the name jupyter, replace the helm install line in the tutorial with:

sudo helm install --name jupyter jupyterhub/jupyterhub -f config_jupyterhub_pangeo_helm.yaml --namespace pangeo

The pangeo configuration file is using a different single user image which has the right version of dask for this tutorial.

(Optional) Simplify deployment using ephemeral storage

Instead of installing and configuring rook, we can temporarily disable permanent storage to make the setup quicker and easier to maintain.

In the JupyterHub configuration yaml set:

hub:
   db:
     type: sqlite-memory

singleuser:
   storage:
      type: none

Now every time a user container is killed and restarted, all data are gone, this is good enough for testing purposes.

Configure Github authentication

Follow the instructions on the Zero-to-Jupyterhub documentation, at the end you should have in the YAML:

auth:
  type: github
  admin:
    access: true
    users: [zonca, otherusername]
  github:
    clientId: "xxxxxxxxxxxxxxxxxxxx"
    clientSecret: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    callbackUrl: "https://js-xxx-xxx.jetstream-cloud.org/hub/oauth_callback"

Test Jupyterhub

Connect to the master node with your browser at: https://js-xxx-xxx.jetstream-cloud.org Login with your Github credentials, you should get a Jupyter Notebook.

You can also check that your pod is running:

sudo kubectl get pods -n pangeo
NAME                                  READY     STATUS    RESTARTS   AGE
jupyter-zonca                         1/1       Running   0          2m
......other pods

Install Dask

We want to deploy a single dask cluster that all the users can submit jobs to.

Customize the dask_shared/dask_config.yaml file available in the repository, for testing purposes I set just 1 GB RAM and 1 CPU limits on each of 3 workers. We can change replicas of the workers to add more.

sudo helm install stable/dask --name=dask --namespace=pangeo -f dask_config.yaml

Then check that the dask instances are running:

$ sudo kubectl get pods --namespace pangeo
NAME                              READY     STATUS    RESTARTS   AGE
dask-jupyter-647bdc8c6d-mqhr4     1/1       Running   0          22m
dask-scheduler-5d98cbf54c-4rtdr   1/1       Running   0          22m
dask-worker-6457975f74-dqhsh      1/1       Running   0          22m
dask-worker-6457975f74-lpvk4      1/1       Running   0          22m
dask-worker-6457975f74-xzcmc      1/1       Running   0          22m
hub-7f75b59fc5-8c2pg              1/1       Running   0          6d
jupyter-zonca                     1/1       Running   0          10m
proxy-6bbf67f6bd-swt7f            2/2       Running   0          6d

Access the scheduler and launch a distributed job

kube-dns gives a name to each service and automatically propagates it to each pod, so we can connect by name

from dask.distributed import Client
client = Client("dask-scheduler:8786")
client

Now we can access the 3 workers that we launched before:

Client
Scheduler: tcp://dask-scheduler:8786
Dashboard: http://dask-scheduler:8787/status
Cluster
Workers: 3
Cores: 6
Memory: 12.43 GB

We can run an example computation with dask array:

import dask.array as da
x = da.random.random((20000, 20000), chunks=(2000, 2000)).persist()
x.sum().compute()

Access the Dask dashboard for monitoring job execution

We need to setup ingress so that a path points to the Dask dashboard instead of Jupyterhub,

Checkout the file dask_shared/dask_webui_ingress.yaml in the repository, it routes the path /dask to the dask-scheduler service.

Create the ingress resource with:

sudo kubectl create ingress -n pangeo -f dask_webui_ingress.yaml

All users can now access the dashboard at:

https://js-xxx-xxx.jetstream-cloud.org/dask/status

Make sure to use /dask/status/ and not only /dask. Currently this is not authenticated, so this address is publicly available. A simple way to hide it is to choose a custom name instead of /dask and edit the ingress accordingly with:

sudo kubectl edit ingress dask -n pangeo