Deploy a ChatGPT-like LLM on Jetstream with llama.cpp, tested on g3.medium

jetstream
llm
Author

Andrea Zonca

Published

March 11, 2026

This is a tested follow-up and updated standalone version of Deploy a ChatGPT-like LLM on Jetstream with llama.cpp. I ran the deployment end to end on a fresh Jetstream Ubuntu 24 g3.medium instance and folded the corrections, runtime notes, and performance measurements into the walkthrough below so you do not need to jump back and forth between posts.

If you want the original September 2025 version for reference, see: Deploy a ChatGPT-like LLM on Jetstream with llama.cpp

In this tutorial we deploy a Large Language Model (LLM) on Jetstream, run inference locally on the smallest currently available GPU node (g3.medium, 10 GB VRAM), then install a web chat interface (Open WebUI) and serve it with HTTPS using Caddy.

Before spinning up your own GPU, consider the managed Jetstream LLM inference service. It may be more cost‑ and time‑effective if you just need API access to standard models.

We will deploy a single quantized model: Meta Llama 3.1 8B Instruct Q3_K_M (GGUF). This quantized 8B model fits comfortably in the tested g3.medium instance.

Model choice & sizing

Jetstream GPU flavors (current key options):

Instance Type Approx. GPU Memory (GB)
g3.medium 10
g3.large 20
g3.xl 40 (full A100)

We pick the quantized Llama 3.1 8B Instruct Q3_K_M variant (GGUF format). Its VRAM residency during inference is about ~8 GB with default context settings, leaving some margin on g3.medium. Always keep a couple of GB free to avoid OOM errors when increasing context length or concurrency.

Ensure the model is an Instruct fine‑tuned variant so it responds well to chat prompts.

Create a Jetstream instance

Log in to Exosphere, request an Ubuntu 24 g3.medium instance (name it chat) and SSH into it using either your SSH key or the passphrase generated by Exosphere.

You will need the public hostname later for HTTPS, for example chat.xxx000000.projects.jetstream-cloud.org. Copy it from Exosphere on the instance details page under Credentials > Hostname. Do not rely on hostname -f on the VM, which may return only the internal hostname.

Load Miniforge

A centrally provided Miniforge module is available on Jetstream images. First initialize Lmod, then load Miniforge (repeat this in each new shell) and create the two Conda environments used below (one for the model server, one for the web UI).

source /etc/profile.d/lmod.sh
module load miniforge
conda init

After running conda init, reload your shell so conda is available: run exec bash -l (avoids logging out and back in).

In non-interactive shells (for example over SSH in a script, inside nohup, or in systemd), prefer conda run -n ENV ... instead of conda activate ENV. The latter depends on shell initialization and is less reliable outside an interactive login shell.

Serve the model with llama.cpp (OpenAI-compatible server)

We use llama.cpp via the llama-cpp-python package, which provides an OpenAI-style HTTP API (default port 8000) that Open WebUI can connect to.

Create an environment and install. Remember to initialize Lmod and then module load miniforge first in any new shell.

The last pip install step may take several minutes to compile llama.cpp from source, so please be patient.

source /etc/profile.d/lmod.sh
module load miniforge
conda create -y -n llama python=3.11
conda activate llama
conda install -y cmake ninja scikit-build-core huggingface_hub
module load nvhpc/24.7/nvhpc
# Enable CUDA acceleration with explicit compilers, arch, release build
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_BUILD_TYPE=Release" \
    conda run -n llama python -m pip install --no-cache-dir --no-build-isolation --force-reinstall "llama-cpp-python[server]==0.3.16"

As of March 11, 2026, 0.3.16 was the newest published llama-cpp-python release.

Download the quantized GGUF file (Q3_K_M variant) from the QuantFactory model page: https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF

Set these variables once:

export HF_REPO="QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF"
export MODEL="Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf"
mkdir -p ~/models
hf download "$HF_REPO" \
    "$MODEL" \
    --local-dir ~/models

If you are not authenticated to Hugging Face, downloads still work for this model but may be subject to lower rate limits.

Test run (Ctrl-C to stop):

source /etc/profile.d/lmod.sh
module load miniforge nvhpc/24.7/nvhpc
conda activate llama
python -m llama_cpp.server \
    --model "$HOME/models/$MODEL" \
    --chat_format llama-3 \
    --n_ctx 8192 \
    --n_gpu_layers -1 \
    --port 8000

--n_gpu_layers -1 tells llama.cpp to offload all model layers to the GPU (full GPU inference). Without this flag the default is CPU layers (n_gpu_layers=0), which results in only ~1 GB of VRAM being used and much slower generation. Full offload of this 8B Q3_K_M model plus context buffers should occupy roughly 8–9 GB VRAM at --n_ctx 8192 on first real requests. If it fails to start with an out‑of‑memory (OOM) error you have a few mitigation options (apply one, then retry):

The nvhpc/24.7/nvhpc module must still be loaded at runtime, not only during the build. Without it, llama_cpp may fail to import with an error like libcudart.so.12: cannot open shared object file.

  • Lower context length: e.g. --n_ctx 4096 (largest single lever; roughly linear VRAM impact for KV cache).
  • Partially offload: replace --n_gpu_layers -1 with a number (e.g. --n_gpu_layers 20). Remaining layers will run on CPU (slower, but reduces VRAM need).
  • Use a lower-bit quantization (e.g. Q2_K) or a smaller model.

You can inspect VRAM usage with:

watch -n 2 nvidia-smi

Quick note on the “KV cache”: During generation the model reuses previously computed attention Key and Value tensors (instead of recalculating them each new token). These tensors are stored per layer and per processed token; as your prompt and conversation grow, the cache grows linearly with the number of tokens kept in context. That’s why idle VRAM (~weights only) is lower (~6 GB) and rises toward the higher number only after longer prompts or chats. Reducing --n_ctx caps the maximum KV cache size; clearing history or restarting frees it.

On the tested g3.medium setup (Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf, full GPU offload, --n_ctx 8192), short local chat-completion requests generated about 85-90 tokens/second after warm-up. In practice that feels very fast for interactive chat, closer to the “instant response” experience users expect from a non-reasoning assistant than a slow step-by-step model. Treat it as an approximate reference point, not a guarantee: longer prompts, larger context, concurrent users, or different quantizations will reduce throughput.

If the test run works, create a systemd service so it restarts automatically.

Using sudo to run your preferred text editor, create /etc/systemd/system/llama.service with the following contents:

[Unit]
Description=Llama.cpp OpenAI-compatible server
After=network.target

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser
Environment=MODEL=Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf
ExecStart=/bin/bash -lc "source /etc/profile.d/lmod.sh; module load nvhpc/24.7/nvhpc miniforge && conda run -n llama python -m llama_cpp.server --model $HOME/models/$MODEL --chat_format llama-3 --n_ctx 8192 --n_gpu_layers -1 --port 8000"
Restart=always

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llama
sudo systemctl start llama

Troubleshooting:

  • Logs: sudo journalctl -u llama -f
  • Status: sudo systemctl status llama
  • GPU usage: nvidia-smi

Configure the chat interface

The chat interface is provided by Open WebUI.

Create the environment. In a new shell remember to initialize Lmod and then module load miniforge first.

The open-webui install pulls in a very large dependency set and can take a while even on a fast connection, so expect this step to be noticeably slower than the earlier llama-cpp-python install.

source /etc/profile.d/lmod.sh
module load miniforge
conda create -y -n open-webui python=3.11
conda run -n open-webui python -m pip install open-webui
conda run -n open-webui open-webui serve --port 8080

If this starts with no error, kill it with Ctrl-C and create a service for it.

Using sudo to run your preferred text editor, create /etc/systemd/system/webui.service with the following contents:

[Unit]
Description=Open Web UI serving
Wants=network-online.target
After=network-online.target llama.service
Requires=llama.service
PartOf=llama.service

[Service]
User=exouser
Group=exouser
WorkingDirectory=/home/exouser
Environment=OPENAI_API_BASE_URL=http://localhost:8000/v1
Environment=OPENAI_API_KEY=local-no-key
ExecStartPre=/bin/bash -lc 'for i in {1..600}; do /usr/bin/curl -sf http://localhost:8000/v1/models >/dev/null && exit 0; sleep 1; done; echo "llama not ready" >&2; exit 1'
ExecStart=/bin/bash -lc 'source /etc/profile.d/lmod.sh; module load miniforge; conda run -n open-webui open-webui serve --port 8080'
Restart=on-failure
RestartSec=5
TimeoutStartSec=600
Type=simple

[Install]
WantedBy=multi-user.target

Then enable and start:

sudo systemctl daemon-reload
sudo systemctl enable webui
sudo systemctl start webui

Optional one-liner to create both services

If you already created the Conda environments (llama and open-webui) and downloaded the model, you can create, enable, and start both systemd services in a single copy-paste. Adjust MODEL, N_CTX, USER, and NVHPC_MOD if needed before running:

: "${MODEL:?export MODEL (model filename) first}" ; N_CTX=8192 USER=exouser NVHPC_MOD=nvhpc/24.7/nvhpc ; sudo tee /etc/systemd/system/llama.service >/dev/null <<EOF && sudo tee /etc/systemd/system/webui.service >/dev/null <<EOF2 && sudo systemctl daemon-reload && sudo systemctl enable --now llama webui
[Unit]
Description=Llama.cpp OpenAI-compatible server
After=network.target

[Service]
User=$USER
Group=$USER
WorkingDirectory=/home/$USER
Environment=MODEL=${MODEL}
ExecStart=/bin/bash -lc "source /etc/profile.d/lmod.sh; module load $NVHPC_MOD miniforge && conda run -n llama python -m llama_cpp.server --model /home/$USER/models/$MODEL --chat_format llama-3 --n_ctx $N_CTX --n_gpu_layers -1 --port 8000"
Restart=always

[Install]
WantedBy=multi-user.target
EOF
[Unit]
Description=Open Web UI serving
Wants=network-online.target
After=network-online.target llama.service
Requires=llama.service
PartOf=llama.service

[Service]
User=$USER
Group=$USER
WorkingDirectory=/home/$USER
Environment=OPENAI_API_BASE_URL=http://localhost:8000/v1
Environment=OPENAI_API_KEY=local-no-key
ExecStartPre=/bin/bash -lc 'for i in {1..600}; do /usr/bin/curl -sf http://localhost:8000/v1/models >/dev/null && exit 0; sleep 1; done; echo "llama not ready" >&2; exit 1'
ExecStart=/bin/bash -lc 'source /etc/profile.d/lmod.sh; module load miniforge; conda run -n open-webui open-webui serve --port 8080'
Restart=on-failure
RestartSec=5
TimeoutStartSec=600
Type=simple

[Install]
WantedBy=multi-user.target
EOF2

To later change context length, edit /etc/systemd/system/llama.service, modify --n_ctx, then run:

sudo systemctl daemon-reload
sudo systemctl restart llama

Configure web server for HTTPS

Finally we can use Caddy to serve the web interface with HTTPS.

Install Caddy. The version in the Ubuntu APT repositories is often outdated, so follow the official Ubuntu installation instructions. You can copy-paste all the lines at once.

Modify the Caddyfile:

sudo sensible-editor /etc/caddy/Caddyfile

to:

chat.xxx000000.projects.jetstream-cloud.org {

        reverse_proxy localhost:8080
}

Where chat is the instance name and xxx000000 is the allocation code. You can find the full hostname in Exosphere: open the instance details page, scroll to Credentials, and copy Hostname.

Then reload Caddy:

sudo systemctl reload caddy

Connect the model and test the chat interface

Point your browser to https://chat.xxx000000.projects.jetstream-cloud.org and you should see the chat interface.

Create an account, click the profile icon in the top right, then open Admin panel > Settings > Connections.

Once you create the first account, that user becomes the admin. Anyone else who signs up is a regular user and must be approved by the admin. This approval step is the only protection in this setup; an attacker could still leverage vulnerabilities in Open WebUI to gain access. For stronger security, use firewall rules to allow connections only from trusted IPs.

Under OpenAI API enter the URL http://localhost:8000/v1 and leave the API key empty.

Click Verify connection, then click Save.

Finally you can start chatting with the model.

What I verified on the tested deployment

On the tested g3.medium VM I was able to:

  • build llama-cpp-python==0.3.16 with CUDA support
  • download Meta-Llama-3.1-8B-Instruct.Q3_K_M.gguf
  • serve the model locally with llama_cpp.server
  • install and launch Open WebUI
  • expose the chat interface publicly with Caddy and HTTPS

The tested public URL was:

  • https://chat.cis230085.projects.jetstream-cloud.org