What is Pegasus?
Pegasus is a workflow management system that helps scientists and engineers execute complex computational workflows. It maps a user’s abstract workflow onto available distributed resources, manages data, and handles execution failures, making it easier to run scientific applications on high-throughput computing (HTC) systems like HTCondor.
Accessing Pegasus on ACCESS
You can access a hosted version of Pegasus through ACCESS. You will need an existing ACCESS account.
- Go to https://support.access-ci.org/tools/pegasus.
- Click on “Local shell access” to get a terminal.
Setting up HTCondor Annex on Expanse
We will follow the documentation for HTCondor Annex, specifically the steps outlined in https://access-ci.atlassian.net/wiki/spaces/ACCESSdocumentation/pages/564887666/HTCondor+Annex.
1. Generate SSH Key
First, generate an SSH key specifically for the annex:
ssh-keygen -f ~/.ssh/annex
2. Configure SSH
Add the following configuration to your ~/.ssh/config
file. This tells SSH to use the newly generated key for Expanse.
Host expanse.sdsc.edu *.expanse.sdsc.edu
User MYUSERNAME # Replace MYUSERNAME with your Expanse username
IdentityFile ~/.ssh/annex
Permissions: Ensure your ~/.ssh/config
file has the correct permissions (read-only for your user) to prevent errors:
chmod 600 ~/.ssh/config
3. Copy SSH Key to Expanse
Copy your public SSH key to Expanse. You will be prompted for your password and MFA code.
ssh-copy-id -i ~/.ssh/annex.pub MYUSERNAME@expanse.sdsc.edu
4. Create a Sample HTCondor Job
Before creating an annex, HTCondor requires a job to execute. Create a file named many_hostname.sub
with the following content:
executable = /bin/hostname
output = out.$(Cluster).$(Process)
error = err.$(Cluster).$(Process)
log = log.$(Cluster)
# 1 core per task so the partitionable slot can split into many tasks
request_cpus = 1
request_memory = 512MB
request_disk = 100MB
# Keep these jobs on the annex
+MayUseAWS = False
requirements = (AnnexName == "zonca") # You can change "zonca" to your desired annex name
queue 128
5. Submit the Sample Job
Submit the job using condor_submit
:
condor_submit many_hostname.sub
6. Create the HTCondor Annex
Now you can create the HTCondor annex. Remember to replace MYUSERNAME
with your Expanse username and set your PROJECT_ID
.
export PROJECT_ID=YOUR_ALLOCATION_ID # Set the ID of your allocation on Expanse
htcondor annex create --nodes 1 --lifetime 3600 --project $PROJECT_ID $USER compute@expanse
Extending or Adding Resources to the HTCondor Annex
To extend the lifetime or add more nodes to an existing HTCondor annex, use the htcondor annex add
command:
htcondor annex add --project $PROJECT_ID --nodes 1 --lifetime 3600 $USER compute@expanse
Monitoring and Output
To check the status of your HTCondor annex, use:
htcondor annex status $USER
During execution, you can also log in to Expanse and monitor the job using squeue
:
squeue -u $USER
Once the job is completed, it will create many out.*
files in your submission directory. Each of these files will contain the hostname of the machine where that specific job ran. Since we requested only 1 node for the annex in this example, all out.*
files will likely contain the same hostname.
Running Jobs with Pegasus (Unsuccessful Attempt)
I attempted to configure Pegasus to run jobs through the HTCondor annex, but I was unable to get it to work properly. I found the process of configuring Pegasus surprisingly difficult. Here’s a gist of what I managed to achieve, though it did not result in a successful Pegasus workflow execution through the annex:
https://gist.github.com/zonca/94b99a5590c43eba2f47d3514b166c88