Headingmessage

SLURM Partitions, Resources & Scheduling

Partitions

There are four regular partitions to choose from when running on Falcon. The default partition is the 'reg' partition. All of the general compute nodes are in all of these partitions.

Name PriorityJobFactor MaxTime MaxSubmitPU
tiny 72 6 hrs unlimited
short 36 24 hrs 1000
reg 18 7 days 500
long 9 unlimited 50

It is best to pick the partition that most closely matches your job run time, as it is more likely to run promptly (due to the higher PriorityJobFactor). Use the '-p' command line flag to pick the partition, eg:

sbatch -p tiny myjob.slurm

There is now also a partition with GPU nodes available

Name PriorityJobFactor MaxTime MaxSubmitPU
gpu-volatile 18 128 hrs 4

Each GPU node in this partition has 2 NVIDIA L40 GPUs, 512 GB RAM, and 64 Cores. You can request 1 or 2 GPUs for your job like this:

sbatch -p gpu-volatile --gres=gpu:1 my_script.slurm

These GPU nodes were purchased through the EPSCoR I-CREWS project and jobs in this partition are subject to preemption from users associated with that project (hence the 'volatile' name).

Resources

In order that slurm doesn't crash nodes by oversubcribing their available RAM, each job is allocated 3GB of RAM by default. If your job exceeds this amount, slurm will forcibly stop it. Use the '--mem-per-cpu=' or '--mem=' arguments to request more. (Note: the --mem-per-cpu argument is in MB, whereas the --mem argument will accept suffixes M | G | T). Most Falcon nodes have 128GB RAM (with 124GB available to slurm jobs), but several have been upgraded to 256GB (250GB available). If your job requires more than 124 GB of RAM, you can request more, and your job will run on nodes that have more RAM. There is one node that has 3TB RAM, if your job requires more than 250GB RAM, request that amount with the '--mem' argument and it will run on the himem node. The himem node is not available in the 'long' partition, if your job will take more than a week to run (and needs more than 256GB RAM), reach out to the Falcon system administrators.

sbatch --mem=150G my_job_script.slurm

If you want your job to use more than one cpu thread (core), you should explicitly request more cores so that nodes do not end up overloaded.

sbatch --cpus-per-task=16 --mem-per-cpu=5000 my_job_script.slurm

Fairshare

Jobs in the shorter partitions are given a higher priority scaling factor over jobs in longer partitions. The higher the priority your job is assigned, the more likely it is to run sooner. We have implemented the Slurm Fairshare feature. Basically, how this works is that the more you use Falcon - the lower priority your jobs have when compared to a user that has not been using as many compute resources. The algorithm also keeps track of usage by Account (University), in this way if one University's users have been making more use of the compute resource, users from that University will have a lower priority.

You can view the current FairShare weighting:

ondemand ~ * sshare
Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000   817254013      1.000000            
 root                      root          1    0.250000         577      0.000001   1.000000 
 bsu                                     1    0.250000    74222408      0.090821            
 isu                                     1    0.250000       25186      0.000015            
 ui                                      1    0.250000   743005840      0.909164            

and for all users, add an -a (user names redacted here)

ondemand ~ * sshare -a
Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000   817254013      1.000000            
 root                      root          1    0.250000         577      0.000001   1.000000 
 bsu                                     1    0.250000    74222408      0.090821            
  bsu                    user_a          1    0.125000     5350669      0.072090   0.500000 
  bsu                    user_b          1    0.125000      146304      0.001971   0.545455 
  bsu                    user_c          1    0.125000    68725319      0.925938   0.454545 
  bsu                    user_d          1    0.125000           0      0.000000   0.772727 
  bsu                    user_e          1    0.125000         114      0.000002   0.590909 
  bsu                    user_f          1    0.125000           0      0.000000   0.636364 
  bsu                    user_g          1    0.125000           0      0.000000   0.772727 
  bsu                    user_h          1    0.125000           0      0.000000   0.772727 
 isu                                     1    0.250000       25186      0.000015            
  isu                    user_i          1    0.250000           0      0.000000   0.954545 
  isu                    user_j          1    0.250000           0      0.000000   0.954545 
  isu                    user_k          1    0.250000           0      0.000000   0.954545 
  isu                    user_l          1    0.250000       25186      1.000000   0.818182 
 ui                                      1    0.250000   743005840      0.909164            
  ui                     user_m          1    0.111111   724227447      0.974726   0.045455 
  ui                     user_n          1    0.111111      237005      0.000319   0.181818 
  ui                     user_o          1    0.111111           0      0.000000   0.409091 
  ui                     user_p          1    0.111111      597585      0.000804   0.136364 
  ui                     user_q          1    0.111111           0      0.000000   0.272727 
  ui                     user_r          1    0.111111    17943801      0.024150   0.090909 
  ui                     user_s          1    0.111111           1      0.000000   0.227273 
  ui                     user_t          1    0.111111           0      0.000000   0.409091 
  ui                     user_u          1    0.111111           0      0.000000   0.409091