Please watch the Introduction to Slurm Tutorials - https://slurm.schedmd.com/tutorials.html

The system scheduler is currently configured as one job per node.

Running Your Jobs

(Quick start) Starting an interactive session using srun

To get started quickly, from a login node:

[username@l001 ~]$ srun --pty bash -i
[username@c001 ~]$

You will notice that the host has changed from l001 to c001, indicating that you are now on a compute node and ready to do HPC work. Your job will expire after an hour, however, unless you specify a longer run time. Also, keep in mind that interactive time will be charged against your account. Also, keep in mind that if INCLINE is running a lot of jobs, you may have to wait a long time before your interactive job becomes available.

For more options and examples on how to use srun to run an interactive job, see https://slurm.schedmd.com/srun.html

Starting an independent interactive session using salloc

This command allocates a node, or collection of nodes, for your use. Basic usage:

[username@l001 ~]$ salloc
salloc: Granted job allocation 888
[username@l001 ~]$

Notice that you are still on the login node, even though the job is now running. Use squeue to determine what node your job is running on:

[username@l001 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               888   compute interact username  R       1:20      1 c022

Under the NODELIST is appears that your job is running on compute node 22. You should now have permission to ssh into this node directly:

[username@l001 ~]$ ssh c022

You should receive the usual INCLINE welcome message, and then the prompt

[username@c022 ~]$ 

indicating you are now on c002. You can now run jobs as you normally would. So why use salloc? The use of an ssh connection allows you to perform port forwarding, which is useful if you want to use jupyter notebooks.

(Recommended) Starting a batch job using squeue

Interactive jobs are good for testing and development, but production type jobs should be submitted using the squeue command to submit a run script. This will put your job in the scheduler and will automatically start it as soon as it can.

Suppose you have a code stored in /home/username/my-big-code that you need to run on four nodes with 512 processors. You expect it to take about 8 hours to complete. Begin by creating the following script:

#!/bin/bash

# The following are bash comments, but will be interpreted by SLURM as parameters.

#SBATCH -J MyBigJob                                   # job name
#SBATCH -o /mmfs1/home/username/output_%j_stdout      # print the job output to this file, where %j will be the job ID
#SBATCH -N 4                                          # run on 4 nodes 
#SBATCH -n 512                                        # run with 512 MPI tasks
#SBATCH -t 8:00:00                                    # run for 8 hours

# Make sure to load/unload any modules that are needed during runtime. For instance, if you need mvapich instead of openmpi:
module swap openmpi4 mvapich2/2.3.4

# Now perform the actual run. Recommend using mpiexec with no -np specifications - it will automatically use all of the available processors
mpiexec /home/username/my-big-code /home/username/inputfile --output-file=/mmfs1/home/username/output_${SLURM_JOB_ID}

Note that the specific commands run by mpiexec are illustrative - you will need to run this however you normally would execute your code.

Once your script is ready, you can submit your job to slurm:

[username@l001 ~]$ sbatch my_run_script.sh
Submitted job 881

This submits your job and gives it an ID, in this case, it is job number 881. You can monitor the status of the job: 

[username@l001 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               881   compute MyBigJob username  R       1:20      1 c004-8

You can also check the output of your job by looking at the output file you specified in your script. To get real-time output, you can do the following

[username@l001 ~]$ tail -f /mmfs1/home/username/output_881_stdout

to get real-time output of your code by tracking the output file.

INCLINE Partitions

StatusPartition NameAccessResourcesMax # nodesMax timeCurrent Priority Job Factor
(higher number = higher priority) 
Description
EnabledcomputeAllCompute nodes2624h2This is the standard workhorse category of partitions for most HPC codes. Jobs submitted to these queues are reasonably high priority, but have a 24 hour time limit. 
EnabledgpuAllGPU nodes224h2
EnabledbigmemAllHigh memory nodes224h2
Disabledcompute-quickAllCompute nodes21h3These partitions are for testing or debugging. Submitting to these queues gets your code running quickly.
Disabledgpu-quickAllGPU nodes11h3
Disabledbigmem-quickAllHigh memory nodes11h3
Disabledcompute-longAllCompute nodes13720h1Use these partitions for long-time jobs that are expected to take multiple days or even weeks. These partitions are low priority but have a long runtime. 
Disabledgpu-longAllGPU nodes1720h1
Disabledbigmem-longAllHigh memory nodes1720h1

Disabled

compute-unlimitedPrivilegedCompute nodes only26Unlimited100These partitions are high-priority, unlimited queues accessible to privileged users only. Use of these queues is available by special request only. The unlimited queues are used for ultra-large-scale production runs, benchmarking tests, etc. 

Disabledgpu-unlimitedPrivilegedGPU nodes only2Unlimited100
Disabledbigmem-unlimitedPrivilegedHigh memory nodes only2Unlimited100
Disabledcompute-USERUSERCompute nodesNUnlimited100These partitions are for users who are the owners of individual nodes on INCLINE. For instance, if bobsmith is a PI who has paid to purchase a compute node, then compute-bobsmith is a special high-priority queue accessible to him and his designated users only. 
Disabledgpu-USERUSERGPU nodesNUnlimited100
Disabledbibmem-USERUSERHigh memory nodesNUnlimited100

For detailed discussion of SLURM prioritization and fairshare algorithm, see this presentation.

  • No labels