Debugging on a cluster


  • Use the python debugger inside interactive jobs (on GPU)
  • Check you have GPU PyTorch/JAX/TensorFlow

  1. Start a short (5 minute) interactive job (on GPU)
    srun --mem=32gb --gres=gpu:1 -p gpu --time=0:05:00 --pty zsh
  2. Run code on GPU with python debugger
    python -m pdb
    • Press c to continue
    • Press u to go up. This is handy when you the code stops at an error and you want to move from low-level package code up to your code.
    • Press d to go down.
  3. Read error messages
  4. Make fixes locally
  5. Push to GitHub
    git push
  6. Pull fixes onto cluster
    git pull
  7. Run code again
    python -m pdb

Check you have GPU

Check you have installed PyTorch with GPU

import torch

Monitor GPU use

On NVIDIA GPUs you can run the NVIDIA System Management Interface to monitor your GPU usage. You can run it with


This is especially useful when combined with tmux


Helper commands

Check how many jobs are running with

watch 'squeue -u scannell -h -t running -r | wc -l'

Check how many jobs are running or queued with

watch 'squeue -u scannell -h -t running,pending -r | wc -l'