If you’re a researcher looking to run AI workloads but don’t know where to start, the San Diego Supercomputer Center (SDSC) at UC San Diego offers a remarkable set of cyberinfrastructure resources purpose-built for data-intensive, AI-enabled science. This guide walks you through each platform, what it’s best for, and exactly how to get access — whether you need GPUs for deep learning, a Kubernetes cluster for flexible computing, or a data platform to discover and collaborate on datasets.
This post is based on a talk by Mai H. Nguyen at the SMASH Lunch Seminar on May 26, 2026, with supplemental information and links for beginners.
What Is Cyberinfrastructure?
Cyberinfrastructure (CI) is the integrated set of computing resources, data storage, networking, software tools, and human expertise that supports research and innovation. Think of it as the technological backbone that enables complex, data-intensive work — from training large language models to analyzing genomic data.
SDSC, part of UC San Diego’s Halicioglu School of Data Science and Computing (HSDSC), develops and operates several CI platforms. Here are the five key resources covered in this guide:
| Platform | Best For | Accelerator Type | Access Method |
|---|---|---|---|
| Expanse | General HPC, heterogeneous computing | NVIDIA V100/A100/H100 | ACCESS allocation |
| Voyager | Deep learning at scale | Intel Gaudi2 | ACCESS or NAIRR Pilot |
| Cosmos | GPU-accelerated science, easier GPU porting | AMD MI300A APU | Request via consult@sdsc.edu |
| NRP (Nautilus) | Flexible K8s computing, Jupyter notebooks | Various (A100, H100, V100, H200) | CILogon signup |
| National Data Platform | Data discovery, collaboration, AI-ready datasets | — | CI Logon signup |
1. Expanse — The Workhorse HPC Cluster
Expanse is SDSC’s flagship HPC system for heterogeneous computing. It supports a wide range of science and engineering workloads, from molecular dynamics to AI/ML training, and is the most versatile entry point for researchers who need traditional HPC with GPU capabilities.
Key Specs
- CPU nodes: 728 AMD EPYC 7742 (Rome) nodes, 128 cores each, 256 GB DRAM
- GPU nodes: 52 NVIDIA V100 GPU nodes (4 GPUs per node, NVLINK) — available via NAIRR
- PATh expansion: 112 AMD Milan CPU nodes + 8 A100 GPU nodes
- Expanse AI Resource (new!): 34 nodes with Intel Sapphire Rapids CPUs, 1 TB memory, 4 NVIDIA H100 GPUs per node, 6.4 TB NVMe
- Storage: 12 PB Lustre + 7 PB Ceph
- Interconnect: HDR InfiniBand, 100 Gb/s
Getting Started
Get an ACCESS account: Visit
https://identity.access-ci.org/new-userand create an account using your institutional credentials.Request an allocation: Go to
https://allocations.access-ci.org/and choose a project type:- Explore (400K credits, 1-page proposal, approved in ~1 business day) — great for getting started, benchmarking, small classes
- Discover (1.5M credits, 3-page proposal, any time) — for modest research needs
- Accelerate (3M credits, 10-page proposal, any time) — mid-scale research
- Maximize (largest, peer-reviewed, twice yearly) — large-scale campaigns
Log in:
ssh your_username@expanse.sdsc.edu(requires 2FA with authenticator app)Try the Expanse User Portal:
https://portal.expanse.sdsc.edu— a web interface for file management, job submission, and launching Jupyter notebooks, RStudio, or MATLAB.Trial accounts: If you just want to test things out, email
consult@sdsc.edufor a trial account with 1,000 core-hours (approved within 1 business day).
Software and Containers
Expanse uses a Spack-based software stack with environment modules (module load ...). Popular AI/ML packages (PyTorch, TensorFlow) are available. SDSC also provides Singularity/Apptainer containers for tools like AlphaFold2, PyTorch, TensorFlow, and more.
Useful Links
- Expanse overview:
https://www.sdsc.edu/systems/expanse/index.html - Expanse user guide:
https://www.sdsc.edu/systems/expanse/user_guide.html - Expanse 101 tutorial:
https://hpc-training.sdsc.edu/expanse-101/ - Expanse training repo:
https://github.com/sdsc-hpc-training-org/expanse-101 - ACCESS allocations:
https://allocations.access-ci.org/
2. Voyager — Purpose-Built for Deep Learning
Voyager is an NSF-funded HPC system specifically designed for AI applications. Its unique feature is the Intel Gaudi2 accelerator — a chip purpose-built for deep learning training and inference, different from traditional NVIDIA GPUs but with easy PyTorch migration.
Key Specs
- 42 training nodes, each with 8 Intel Gaudi2 accelerators (336 total)
- 36 Intel x86 compute nodes for data processing
- 400 GbE RoCE interconnect with all-to-all networking within nodes
- Storage: 3 PB Ceph + 324 TB home
- Memory: 512 GB per training node, 96 GB HBM2 per Gaudi2 accelerator
Why Intel Gaudi?
If you’ve been using NVIDIA GPUs, migrating to Intel Gaudi is straightforward:
- PyTorch is natively integrated with the SynapseAI software stack — minimal code changes needed
- Easy GPU migration: Import
habana_frameworksin your PyTorch code and you’re largely set - HuggingFace support: Use the Optimum Habana library to run Transformers and Diffusers models with minimal modification
- Advanced users can write custom kernels if needed
- Jupyter notebooks are also available on Voyager
Getting Started
Request access via:
- NAIRR Pilot:
https://nairrpilot.org/opportunities/allocations— submit a 3-page proposal, reviewed monthly. Open to US-based researchers at academic institutions, nonprofits, and startups with federal grants. - ACCESS:
https://allocations.access-ci.org/— Voyager is also available through ACCESS allocations.
- NAIRR Pilot:
Log in:
ssh login.voyager.sdsc.edu(SSH key required — send your public key toconsult@sdsc.edu)Run jobs: Voyager uses Kubernetes, not Slurm. You submit workloads as Kubernetes pods using YAML files. Example:
apiVersion: v1 kind: Pod spec: containers: - image: vault.habana.ai/gaudi-docker/1.10.0/ubuntu22:1.8.0-2 command: ["hl-smi"] resources: limits: habana.ai/gaudi: 1 hugepages-2Mi: 3800Mi memory: 32G
Sample Applications
Researchers have already run diverse AI workloads on Voyager: diffusion models for cosmology super-resolution, BioBERT for biomedical text, U-Net for cardiac imaging, graph neural networks for high-energy physics, LLM fine-tuning for epilepsy, and more.
Useful Links
- Voyager overview:
https://www.sdsc.edu/systems/voyager/index.html - Voyager user guide:
https://www.sdsc.edu/systems/voyager/user_guide.html - Reference models and tutorials:
https://github.com/javierhndev/Voyager-Reference-Models - Intel Gaudi documentation:
https://docs.habana.ai/ - PyTorch on Intel Gaudi:
https://docs.habana.ai/en/latest/PyTorch/index.html - Optimum Habana (HuggingFace on Gaudi):
https://github.com/huggingface/optimum-habana - Voyager 101 webinar (Apr 2025):
https://github.com/sdsc-hpc-training-org/advanced-computing-webinars/tree/main/Apr-08-2025-Porting-PyTorch-to-Voyager - NAIRR Pilot allocations:
https://nairrpilot.org/opportunities/allocations
3. Cosmos — Democratizing GPU Acceleration
Cosmos is SDSC’s newest testbed system, built around the AMD Instinct MI300A APU — a chip that integrates both CPU and GPU on a single package with unified memory. This design eliminates the traditional hurdle of copying data between CPU and GPU memory, making GPU acceleration dramatically easier to adopt.
Key Specs
- 42 nodes, each with 4 AMD MI300A APUs (168 APUs total)
- Unified memory: 128 GB HBM3 per APU, shared coherently between CPU and GPU (5.3 TB/s peak throughput)
- 24 EPYC Zen4 CPU cores + 228 GPU compute cores per APU
- HPE Slingshot interconnect with AMD Infinity xGMI between sockets
- VAST flash storage: ~500 TB, high IOPS
- Ceph capacity storage: 4.9 PB
- 100% liquid cooled HPE Cray EX system
Why the APU Matters for Beginners
Traditional GPU programming requires managing two separate memory spaces (CPU host memory and GPU device memory) and explicitly copying data between them. The MI300A APU’s unified memory means:
- No host/device data copies — CPU and GPU share the same memory
- Incremental porting approach — start with CPU code, gradually offload compute-intensive parts to GPU
- “No code left behind” — many applications that were never ported to GPUs can now be accelerated with minimal effort
Getting Started
- Cosmos is currently in testbed phase. Access is by project approval, not open allocation yet.
- Email
consult@sdsc.eduwith a summary of your project. The Cosmos team evaluates suitability and sets up a follow-up call. - Log in:
ssh your_username@login01.cosmos.sdsc.edu(SSH key required) - Jobs use Slurm — the familiar batch scheduling system:
sbatch,srun, etc.
Cosmos uses ROCm (AMD’s GPU computing platform) and provides containerized environments with AI/ML frameworks pre-installed.
Useful Links
- Cosmos overview:
https://www.sdsc.edu/systems/cosmos/index.html - Cosmos user guide:
https://www.sdsc.edu/systems/cosmos/user_guide.html - AMD CDNA 3 architecture white paper:
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf - NSF award announcement:
https://www.sdsc.edu/news/2024/PR20240712_Cosmos_award.html
4. National Research Platform (NRP) — Federated, Flexible, and Free
The National Research Platform is a partnership of more than 80 institutions across the US and globally, led by UC San Diego, University of Nebraska-Lincoln, and the Massachusetts Green High Performance Computing Center. Its primary computing resource is the Nautilus cluster — a massive Kubernetes-based distributed system spanning 4 continents.
What Makes NRP Different
- No allocation proposals needed — sign up with your institutional credentials and start using it
- Kubernetes-native — flexible, container-based computing
- Diverse hardware — A100, H100, H200, V100 GPUs and more, contributed by partner institutions
- Persistent storage — CephFS, CvmFS, S3, or bring your own
- Built-in services — JupyterHub, GitLab with CI/CD runners, Nextcloud, Overleaf, and more
Three Ways to Use NRP
JupyterHub — The easiest way. Log in through your browser, choose your hardware, and start running notebooks. No Kubernetes knowledge required. Visit the NRP JupyterHub and authenticate with your institutional credentials.
Coder — A JupyterHub-like experience with a full VS Code environment in the browser. Also requires no Kubernetes knowledge.
kubectl — For maximum control. Define pods, jobs, and deployments with YAML files, specifying CPU, GPU, memory, and storage requirements. Requires basic Kubernetes knowledge.
LLM Inference Service
NRP provides free, authenticated access to large language models — a fantastic resource for researchers who need LLM capabilities without paying API costs:
- Web chat interface:
https://librechat.nrp-nautilus.ioorhttps://nrp-llm-chat.nrp-nautilus.io - API access (OpenAI-compatible): Get a token at
https://nrp.ai/llmtoken/, then use the OpenAI Python client:
from openai import OpenAI
client = OpenAI(
api_key="your-nrp-token",
base_url="https://ellm.nrp-nautilus.io/v1"
)
completion = client.chat.completions.create(
model="gpt-oss",
messages=[
{"role": "user", "content": "Explain quantum computing in one paragraph."}
]
)
print(completion.choices[0].message.content)Getting Started
- Sign up: Go to
https://nrp.ai/get-accessand log in with your institutional credentials via CILogon - Get added to a namespace — contact your research supervisor or request to be promoted to admin to create your own namespace
- Install kubectl and download your kubeconfig from the portal
- Or just use JupyterHub for the simplest experience
Useful Links
- NRP main site:
https://nrp.ai/ - Getting started:
https://nrp.ai/documentation/userdocs/start/getting-started - Get access:
https://nrp.ai/get-access - Full documentation:
https://docs.nationalresearchplatform.org/ - Using Nautilus (3 methods):
https://nrp.ai/documentation/userdocs/start/using-nautilus/ - LLM managed service:
https://nrp.ai/documentation/userdocs/ai/llm-managed/ - LLM API access:
https://nrp.ai/documentation/userdocs/ai/llm-managed/api-access/ - Dashboard:
https://dash.nrp-nautilus.io/
5. National Data Platform (NDP) — Data Discovery and Collaboration
The National Data Platform is a federated ecosystem for discovering, sharing, and computing on data. It’s not a compute cluster — it’s a platform that connects data, services, AI tools, and computing resources in one place.
Key Features
- Data Catalog — Search across registered datasets from multiple institutions using substring search, conceptual search (powered by Open Knowledge Networks), spatial-temporal search, and filtered search by organization
- Collaborative Workspace — Web-based interface where multiple users can share resources, tools, and data in real time
- AI-Ready Data — Structured, curated datasets optimized for AI projects
- Compute Integration — Connect to HPC or cloud resources for data processing and ML training
- Education Hub — Hands-on modules and classroom resources for courses that need computational tools and AI-ready data
The NDP Workflow
- Discover — Find data and digital assets in the catalog
- Collect — Add additional assets and workflow inputs
- Collaborate — Create a shared workspace with collaborators
- Compute — Execute workflows on HPC or cloud
- Store/Export — Save final data products
- Extend — Add endpoint services as needed
Getting Started
- Go to
https://nationaldataplatform.org/ - Click “Log in/Register” → Select “Sign in with CI Logon” → Choose “University of California, San Diego” as your Identity Provider → Log in with your UCSD Active Directory credentials
- Browse the catalog or set up a workspace
Useful Links
- NDP main site:
https://nationaldataplatform.org/ - NDP documentation:
https://nationaldataplatform.org/documentation/ - Data catalog:
https://nationaldataplatform.org/documentation/ndp-catalog/ - Set up workspace tutorial:
https://nationaldataplatform.org/documentation/quick-start/set-up-workspace/
Which Platform Should You Choose?
Here’s a simple decision guide:
- Need traditional HPC with GPU support? → Start with Expanse (ACCESS Explore allocation, approved in ~1 day)
- Training large deep learning models? → Voyager (Intel Gaudi2, purpose-built for AI)
- Have CPU-only code you want to accelerate with GPUs? → Cosmos (unified memory APUs make porting easy)
- Want free, flexible computing without writing proposals? → NRP Nautilus (sign up and go)
- Need to find and collaborate on datasets? → National Data Platform
- Just want to try things out? → NRP JupyterHub (fastest path from zero to running code) or Expanse trial account (email
consult@sdsc.edu)
SDSC Training Resources
SDSC offers extensive training programs to help you get up to speed:
- SDSC HPC/CI Training:
https://hpc-training.sdsc.edu/— tutorials for Expanse, Voyager, and more - HPC Basic Skills:
https://hpc-training.sdsc.edu/basic_skills/— Linux, bash, batch computing fundamentals - COMPLECS: Non-programming skills for using supercomputers (parallel computing concepts, security, data management)
- CIML Summer Institute: Best practices for machine learning on HPC systems
- HPC & Data Science Summer Institute: Week-long introductory workshop
- Advanced HPC/CI Webinars:
https://www.sdsc.edu/education/training-programs/index.html— ongoing webinar series on AI/ML, performance analysis, and visualization - Expanse 101 Tutorial:
https://hpc-training.sdsc.edu/expanse-101/
Getting Help
For any SDSC resource, email consult@sdsc.edu — the HPC consulting team can help you choose the right platform, get access, and optimize your workflows.
This post is based on the SMASH Lunch Talk “Cyberinfrastructure for AI-Enabled Research at SDSC” by Mai H. Nguyen (SDSC), presented May 26, 2026. The original slides are available on the Trello card for this project. Thanks to Mai for permission to create this blog post.