A minimalist guide to getting started with Slurm

2020.08.25 49 Views 49

This is the second post of the series "A minimalist guide for Data Scientist". The first one is A minimalist guide to getting started with SSH

As a data scientist, you may need to deal with high-performance computing and clusters instead of a single remote server. And Slurm is a highly scalable cluster management and job scheduling system for Linux clusters, which is widely used in academia and industry now.

In this post, there will be three sections.

  • Common commands of Slurm system
  • How to set up a specific environment and run python programs for a deep learning task with an example
  • Useful links about Slurm

As a data scientist, you will be a user of Slurm instead of an administrator. In Slurm, there is a login node for each user. You can login to this login node with SSH. This login node is a server that acts as the interface to the cluster. You can upload files to login node. You will use compute nodes to run the tasks.

First, you can use sinfo to get the information of all compute nodes. Here's a example.

Image

A partition is a group of compute nodes with same property. Here * means it is by default. Nodes with GPU may have the name like 'gpu-small'. The state is the work state about the partition. 'Idle' means that partition is available for new tasks. TIMELIMIT is the limit of time that your task can run each time.

After you know this, you plan to write a bash file for the task. Here's an example bash file test.sh.

#!/bin/sh
# 10 hour timelimit:
#SBATCH --time 10:00:00
#SBATCH --partition=small-gpu 
#SBATCH --mail-user=<your email address>
#SBATCH --mail-type=ALL
module load anaconda

echo "Hello"

The first 6 lines are the setting of sbatch task. --mail-user setting lets the system send you emails when your task starts or ends. The task is to say 'Hello' here.

Then you can use sbatch test.sh to run this file. sbatch command is to submit a job script for later execution. It is useful because with sbatch the Slurm system will automatically allocate the partition and make schedule for your task.

After you run sbatch test.sh, you will see something like this:

Submitted batch job 12345

Here 12345 is your unique job id. To check the status of your task (whether it starts or not..), you can use squeue. Here's an example.

Image

Here 'ST' means the status. 'R' means running and 'PD' means pending because the partition is not available.


Now we know the basics of the Slurm. In this section, we will introduce to set environment, dependencies for a deep learning task with sbatch. Our task is to train neural network on OpenAI Gym with Tensorflow.

Generally, anaconda or python should be already installed as a module in the system. If not, please contact your administrator.

In sbatch script, we use module load command to load them.

module load anaconda

We create an conda environment with specific dependencies in this script.

conda create -n my_env python=3.6 tensorflow-gpu==1.15

Then we activate it.

eval "$(conda shell.bash hook)"
conda activate bluesky36_new

Remember to add this eval "$(conda shell.bash hook)" command. This step is to initialize conda. And you need to use conda activate here. source activate will not activate the environment successfully.
You may also ask whether we can activate the environment from login node. The answer is 'No'.

In this task, we need OpenAI Gym package. We inistall it with pip.

pip install gym[atari]

The final step is to run python file.

python path/test.py

We run this batch file by sbatch test.sh

The complete test.sh is shown below:

#!/bin/sh
# 10 hour timelimit:
#SBATCH --time 10:00:00
#SBATCH --partition=small-gpu 
#SBATCH --mail-user=<you email>
#SBATCH --mail-type=ALL
module load anaconda

conda create -n my_env python=3.6 tensorflow-gpu==1.15
eval "$(conda shell.bash hook)"
conda activate bluesky36_new
pip install gym[atari]

python path/test.py

If the job id is 12345, then the system will generate a file slurm-12345.out to record all the output information. For this job, it will include information like this:

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: ~/my_env

  added / updated specs:
    - python=3.6
    - tensorflow-gpu==1.15

And if everything works well, you will also see the output of your python program in slurm-12345.out.


If you want to know more details, here are some useful links.

  1. Slurm Official Quick Start User Guide
  2. Documentation of HPC in Yale
  3. GWU Colonial One Cluster Tutorial-1
  4. GWU Colonial One Cluster Tutorial-2
  5. GWU Colonial One Cluster Tutorial-3
Comments
Write a Comment