HPC environments and SLURM
Every HPC system is differently configured and it takes some time to get to know the file structure and optimized workflow. Here I give basic information about exploring HPC systems as well as some details of SLURM job scheduler.
HPC Environments and modules
HPC systems use modules to manage software environments. That means unlike the local machine, not all the paths are accessible. We have to load specific modules before we can access specific executables.
module avail
- Description: Lists all available modules, there could be thousands of modules. It lists all their names and the partition it resides.
module spiderdoes the same thing.
module spider <keyword>
- Description: Lists all modules that contains the string
keyword. If there is a module namedkeyword, then it shows detailed information about that module.
module list
- Description: Lists all modules that are currently loaded.
module purge
- Description: It unloads all the modules. By default, the HPC may load some module when we log in. Purging will unload all modules. Use with caution.
module load <module_name>
- Description: Loads a specific module (e.g.,
module load quantumespresso/7.2).
module unload <module_name>
- Description:: Unloads a specific module.
In rare case, we have to source some script to set some environment variables after loading the modules. Such as in ARF cluster, we have to source intel/setvars.sh or something like this.
Job Submission on HPC
In local machine, we can simply execute a command by saying pw.x or in case of parallel running, mpirun -np 10 pw.x, etc. However, in an HPC, they usually employ a job scheduling systems such as SLURM or PBS. The job scheduler prioritizes which job to run first, or which nodes to assign to which job, memory allocation, etc. I am only familiar with SLURM. If you want, you can even install SLURM in your local machine. Here are some common commands for SLURM compatible HPC:
sbatch
- Usage:
sbatch job.sh - Description: It submits a
job.shjob file to the queue. More on thejob.shfile is below. When we submit a job, it shows a message sayingSubmitted batch jobs <job_ID>.
squeue
- Usage:
squeue - Description: It shows the list of current job queue. The
PDjob status refers to pending job,Rrefers to running, andCGrefers to closing (due to an error or normal finish) job. If the job is finished, it doesn't show in the queue.
scancel
- Usage:
scancel <job_ID> - Description: As the name suggests, we can cancel a job at any stage.
sinfo
- Usage:
sinfo - Description: This command shows the partition information and node availability. It's super helpful to see which partition and nodes are available right before submitting a job. However, some HPC such as MN5 does not give access to
sinfocommands. It can be used withgrepto get the names of available partition and nodes such assinfo | grep idle
sacct
- Usage:
sacct -j <job_ID> - Description: It shows the accounting information about jobs. Sometimes it helps to inspect this to know about the timing and other information about a job. We can use
-ooption to get more fields. - Example:
$ sacct -j 1926489 -o JobID,State,Elapsed,Start,End,AllocCPUs,ExitCode JobID State Elapsed Start End AllocCPUS ExitCode ------------ ---------- ---------- ------------------- ------------------- ---------- -------- 1926489 COMPLETED 00:18:49 2025-04-03T21:51:58 2025-04-03T22:10:47 220 0:0 1926489.bat+ COMPLETED 00:18:49 2025-04-03T21:51:58 2025-04-03T22:10:47 110 0:0 1926489.ext+ COMPLETED 00:18:49 2025-04-03T21:51:58 2025-04-03T22:10:47 220 0:0 1926489.0 COMPLETED 00:18:47 2025-04-03T21:52:00 2025-04-03T22:10:47 220 0:0
seff
- Usage:
seff <job_ID> - Description: Not every HPC has this but if your HPC has this, it shows the job efficiency information. Run this command afte the job finishes, otherwise it gives inaccurate info for running/pending jobs.
- Example:
$ seff 1926489 ID: 1926489 Cluster: arf User/Group: amuhaymin/amuhaymin State: COMPLETED (exit code 0) Nodes: 2 Cores per node: 110 CPU Utilized: 2-19:36:54 CPU Efficiency: 98.00% of 2-20:59:40 core-walltime Job Wall-clock time: 00:18:49 Memory Utilized: 203.68 GB (estimated maximum) Memory Efficiency: 47.40% of 429.69 GB (1.95 GB/core)
SLURM job files
SLURM job file can be submitted with sbatch command. It's possible to submit any bash shell file with sbatch as sbatch -A username -J jobname job.sh but to be systematic, we put all the slurm directives inside the job file. Below I gave an example of such job file:
#!/bin/bash
#SBATCH --job-name=my_dft_job
#SBATCH --account=username
#SBATCH --partition=orfoz
#SBATCH --ntasks=550
#SBATCH --nodes=5
#SBATCH --time=2-10:20:30
#SBATCH --output=output_%j.txt
#SBATCH --error=error_%j.txt
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@example.com
module load quantum_espresso
srun pw.x -in input_file.in > output_file.out
#!/bin/bash
#SBATCH -J my_dft_job
#SBATCH -A username
#SBATCH -p orfoz
#SBATCH -n 550
#SBATCH -N 5
#SBATCH -t 2-10:20:30
#SBATCH -o output_%j.txt
#SBATCH -e error_%j.txt
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@example.com
module load quantum_espresso
srun pw.x -in input_file.in > output_file.out
You can learn more about each of the key word from this page. Especially check the sbatch page.
Useful Tips and Shortcuts
-
Tab Completion: Use the
Tabkey to auto-complete commands and filenames. -
Command History: Type
historyto view a list of previously executed commands. -
Searching in Files:
Usegrepto search for text patterns. For example, QE output file reports the total energy lines with an exclamation mark. So,grep ! output_file.outwill show the line with energy. Similarly,grep accuracy output_file.outwill show the trend in the SCF accuracy which will help during a run to determine if the calculation is converging or not. -
Clear Screen: Use
clearto clean your terminal. -
Monitoring Resources:
toporhtopto view running processes.free -hto check memory usage. It's good for checking on your local machine, not so good on HPC. -
Feedback loop: Use the output of a command as the input of another command using the pipe operator
|. For example,module availwill list all the available modules but if we are trying to find wannier, we can domodule avail | grep wannier.