Hydra submitit launcher


  • Hydra’s Submitit Launcher plugin makes it easy to submit SLURM jobs
  • Easily sweep over multiple random seeds
  • Easily sweep over different combinations of hyperparameters

Hydra’s Submitit Launcher plugin provides a SLURM launcher based on Submitit. It is extremely powerful as you can submit slurm jobs without writing sbatch scripts.

The main idea is to put the info we’d pass to srun (e.g. --gpus-per-task=1) inside a hydra config. Then, instead of writing sbatch scripts with array jobs (e.g. submit 5 jobs with different seeds), we can easily submit many jobs from the command line. Let me show with an example.


Consider a simple example with the following directory structure

└── train.py  # training script
├── cfgs/
│   └── train.yaml  # main config
│   └── hydra/
│       └── launchers/
│           └── MY_CLUSTER_2HRS.yaml

The main training script train.py looks something like

import hydra

@hydra.main(version_base="1.3", config_path="./cfgs", config_name="train")
def train(cfg):

if __name__ == "__main__":

where the learning_rate, batch_size and seed are configured in the config file in cfgs/train.yaml

  - override hydra/launcher: MY_CLUSTER_2HRS
  - _self_

batch_size: 128
learning_rate: 1e-4
seed: 42

We now add a configuration file cfgs/hydra/launcher/MY_CLUSTER_2HRS.yaml to configure the parameters we’d usually set in the sbatch script. Note that the file must be inside cfgs/hydra/launcher/ as we’re overriding hydra’s default launcher. Here’s a simple example requesting 1 GPU for 2 hours.

  - submitit_slurm

_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
timeout_min: 120 # 2 hours
tasks_per_node: 1
nodes: 1
name: ${hydra.job.name}
comment: null
exclude: null
signal_delay_s: 600
max_num_timeout: 20
additional_parameters: {}
array_parallelism: 256
setup: []
constraint: "volta"
mem_gb: 50
gres: gpu:1

Submit a single SLURM job

We can now submit SLURM jobs straight from the command line with

python train.py -m ++learning_rate=1e-3 ++batch_size=64

The -m/--multirun tells hydra to use multirun which will now submit a slurm job with the settings we specified in cfgs/hydra/launcher/MY_CLUSTER_2HRS.yaml.

Sweep over hyperparameters

We can easily perform sweeps over hyperparameters. The following command will submit four jobs running train.py with each combination of learning_rate and batch_size

python train.py -m ++learning_rate=1e-3,1e-4 ++batch_size=64,128

This massively speeds up hyperparameter search and makes configuring experiments a lot easier.

Sweep over multiple random seeds

We can also easily submit jobs with multiple random seeds. For example,

python train.py ++seed=42,100,696,23492,15