Debugging on a cluster

TL;DR

Start a short (5 minute) interactive job (on GPU)

srun --mem=32gb --gres=gpu:1 -p gpu --time=0:05:00 --pty zsh

Check you have installed PyTorch with GPU

import torch
torch.cuda.is_available()

On NVIDIA GPUs you can run the NVIDIA System Management Interface to monitor your GPU usage. You can run it with

nvidia-smi

This is especially useful when combined with tmux

tmux

Check how many jobs are running with

watch 'squeue -u scannell -h -t running -r | wc -l'

Check how many jobs are running or queued with

watch 'squeue -u scannell -h -t running,pending -r | wc -l'

Last updated on May 5, 2019