Debugging on a cluster
TL;DR
- Use the python debugger inside interactive jobs (on GPU)
- Check you have GPU PyTorch/JAX/TensorFlow
Debugging on a cluster
- Start a short (5 minute) interactive job (on GPU)
srun --mem=32gb --gres=gpu:1 -p gpu --time=0:05:00 --pty zsh
- Run code on GPU with python debugger
python -m pdb train.py
- Press
c
to continue - Press
u
to go up. This is handy when you the code stops at an error and you want to move from low-level package code up to your code. - Press
d
to go down.
- Press
- Read error messages
- Make fixes locally
- Push to GitHub
git push
- Pull fixes onto cluster
git pull
- Run code again
python -m pdb train.py
Check you have GPU
Check you have installed PyTorch with GPU
import torch
torch.cuda.is_available()
Monitor GPU use
On NVIDIA GPUs you can run the NVIDIA System Management Interface to monitor your GPU usage. You can run it with
nvidia-smi
This is especially useful when combined with tmux
tmux
Helper commands
Check how many jobs are running with
watch 'squeue -u scannell -h -t running -r | wc -l'
Check how many jobs are running or queued with
watch 'squeue -u scannell -h -t running,pending -r | wc -l'