This is the second post of the series "A minimalist guide for Data Scientist". The first one is A minimalist guide to getting started with SSH
As a data scientist, you may need to deal with high-performance computing and clusters instead of a single remote server. And Slurm is a highly scalable cluster management and job scheduling system for Linux clusters, which is widely used in academia and industry now.
In this post, there will be three sections.
- Common commands of Slurm system
- How to set up a specific environment and run python programs for a deep learning task with an example
- Useful links about Slurm
As a data scientist, you will be a user of Slurm instead of an administrator. In Slurm, there is a login node for each user. You can login to this login node with
SSH. This login node is a server that acts as the interface to the cluster. You can upload files to login node. You will use compute nodes to run the tasks.
First, you can use
sinfo to get the information of all compute nodes. Here's a example.
A partition is a group of compute nodes with same property. Here * means it is by default. Nodes with GPU may have the name like 'gpu-small'. The state is the work state about the partition. 'Idle' means that partition is available for new tasks. TIMELIMIT is the limit of time that your task can run each time.
After you know this, you plan to write a bash file for the task. Here's an example bash file
#!/bin/sh # 10 hour timelimit: #SBATCH --time 10:00:00 #SBATCH --partition=small-gpu #SBATCH --mail-user=<your email address> #SBATCH --mail-type=ALL module load anaconda echo "Hello"
The first 6 lines are the setting of sbatch task.
--mail-user setting lets the system send you emails when your task starts or ends. The task is to say 'Hello' here.
Then you can use
sbatch test.sh to run this file.
sbatch command is to submit a job script for later execution. It is useful because with
sbatch the Slurm system will automatically allocate the partition and make schedule for your task.
After you run
sbatch test.sh, you will see something like this:
Submitted batch job 12345
12345 is your unique job id. To check the status of your task (whether it starts or not..), you can use
squeue. Here's an example.
Here 'ST' means the status. 'R' means running and 'PD' means pending because the partition is not available.
Now we know the basics of the Slurm. In this section, we will introduce to set environment, dependencies for a deep learning task with
sbatch. Our task is to train neural network on OpenAI Gym with Tensorflow.
Generally, anaconda or python should be already installed as a
module in the system. If not, please contact your administrator.
In sbatch script, we use
module load command to load them.
module load anaconda
We create an conda environment with specific dependencies in this script.
conda create -n my_env python=3.6 tensorflow-gpu==1.15
Then we activate it.
eval "$(conda shell.bash hook)" conda activate bluesky36_new
Remember to add this
eval "$(conda shell.bash hook)" command. This step is to initialize conda. And you need to use
conda activate here.
source activate will not activate the environment successfully.
You may also ask whether we can activate the environment from login node. The answer is 'No'.
In this task, we need OpenAI Gym package. We inistall it with
pip install gym[atari]
The final step is to run python file.
We run this batch file by
The complete test.sh is shown below:
#!/bin/sh # 10 hour timelimit: #SBATCH --time 10:00:00 #SBATCH --partition=small-gpu #SBATCH --mail-user=<you email> #SBATCH --mail-type=ALL module load anaconda conda create -n my_env python=3.6 tensorflow-gpu==1.15 eval "$(conda shell.bash hook)" conda activate bluesky36_new pip install gym[atari] python path/test.py
If the job id is
12345, then the system will generate a file
slurm-12345.out to record all the output information. For this job, it will include information like this:
Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... done ## Package Plan ## environment location: ~/my_env added / updated specs: - python=3.6 - tensorflow-gpu==1.15
And if everything works well, you will also see the output of your python program in
If you want to know more details, here are some useful links.