Debugging on a cluster

TL;DR

  • Use the python debugger inside interactive jobs (on GPU)
  • Check you have GPU PyTorch/JAX/TensorFlow

Debugging on a cluster

  1. Start a short (5 minute) interactive job (on GPU)
    srun --mem=32gb --gres=gpu:1 -p gpu --time=0:05:00 --pty zsh
    
  2. Run code on GPU with python debugger
    python -m pdb train.py
    
    • Press c to continue
    • Press u to go up. This is handy when you the code stops at an error and you want to move from low-level package code up to your code.
    • Press d to go down.
  3. Read error messages
  4. Make fixes locally
  5. Push to GitHub
    git push
    
  6. Pull fixes onto cluster
    git pull
    
  7. Run code again
    python -m pdb train.py
    

Check you have GPU

Check you have installed PyTorch with GPU

import torch
torch.cuda.is_available()

Monitor GPU use

On NVIDIA GPUs you can run the NVIDIA System Management Interface to monitor your GPU usage. You can run it with

nvidia-smi

This is especially useful when combined with tmux

tmux

Helper commands

Check how many jobs are running with

watch 'squeue -u scannell -h -t running -r | wc -l'

Check how many jobs are running or queued with

watch 'squeue -u scannell -h -t running,pending -r | wc -l'
Previous
Next