01 Feb '22
Running an AI neural style transfer model under Singularity
I’ve recently been given access to a beefy AI server (6x RTX3090s!) which is managed via SingularityCE, whose homepage boldly asks and then forgets to answer the question: “What is SingularityCE?”
If you dig further into the documentation it’s a little less coy:
SingularityCE is a container platform. It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible. You can build a container using SingularityCE on your laptop, and then run it on many of the largest HPC clusters in the world, local university or company clusters, a single server, in the cloud, or on a workstation down the hall. Your container is a single file, and you don’t have to worry about how to install all the software you need on each different operating system.
I want to fire up my new GPUs and run one of Katherine Crowson’s awesome pytorch scripts to do some neural style transfer. I’m very familiar with Docker, but new to this Singularity thing, so here are some of the hurdles I encountered (and cleared) along the way.
Finding a base image
Looking in the style transfer repo’s
setup.py
,
it looks like torch v1.7.1 or later is required. Having
done this sort of thing before, I know that these deep learning frameworks
change a fair bit even between minor versions, so the safest option is to pick
the exact version that it was designed for—in this case v1.7.1 (or at least
v1.7.x).
So, the challenge is to find a Singularity image with that version of torch installed. The singularity docs suggest using the search command like so:
$ singularity search torch
Found 34 container images for amd64 matching "torch":
library://adalisan81/default/pytorch:latest
library://aday651/default/pytorch-geometric-gpu:latest
library://aphoh/default/pytorch-20.11-py3:v-1
library://aradeva24/default/ar_pytorch_21.06-py3.sif:latest
PyTorch NGC container with CUDA11.0, where PyTorch and apex are installed
library://calebh/hpccm-test/faircluster-pytorch-1.10-cuda11.3:sha256.7c63a6c1f6f125b8d3e14fa10203965536ec7173d50e85b8c9ecf6ee0bff2ba7
library://claytonm/default/ubuntu18_torch_torchvision_opencv_cuda10:latest
library://dxtr/default/hpc-pytorch:0.1
library://guoweihe/default/pytorch:hz1
library://guoweihe/default/pytorch:v1.2
library://guoweihe/default/torch:deep-ed
library://guoweihe/default/torch:sha256.ff32c85ade2c8f6a1d34bd500de1b7bd11cdac16461aeef4d7cbd16ab129d8a7
library://guoweihe/default/torchgpipe:master
library://guoweihe/default/torchgpipe:sha256.a6ea5d732cba07c043e2f06cccbe541d28da6a8d9e5a3d18872d58af288dbc62
library://ipa/medimgproc/pytorch:latest
library://jamiechang917/default/pytorch:sha256.9c60c9825f20626cc0d6e69ac61d862bfec927e82d86becee73f853d657f2425
library://lamastex/default/pytorch_21.03.sif:berzelius-20211027
library://lamastex/default/pytorch_21.07.sif:berzelius-20211027
library://lev-hpc/ml/pytorch_gpu:jupyter
library://lev-hpc/ml/torch_tf_jupyter:latest
library://mbalazs/default/pytorch:latest
library://mbalazs/default/pytorch_cuda110:latest
library://mike_holcomb/pytorchvision/v0.1.0:latest,v0.1.0
library://oscartang/default/pytorch_translation:latest
library://ottino8/default/pytorch:first
library://pauldhein/hpc-deep-learning/torch-base:latest
library://sina-ehsani/default/transformer-googlecrawl-torch-opencv:latest
library://sina-ehsani/default/transformer-googlecrawl-torch1.10:latest
library://skykiny/default/pytorch_skykiny:latest
library://tmyoda/default/cuda-torch-pyenv:latest
library://tru/default/c7-conda-pytorch-10.0:2019-07-12-2053,latest
library://ufscar/hpc/cuda_pytorch:latest
library://uvarc/default/pytorch:1.4.0-py37
library://yboget/default/pytorch_rdkit_visdom:sha256.d97f221ef1294a8ef57d40cf7994d05d4955abc7cf39e3ce42faafd59fe3151a
library://zhengtang/default/torch_translation:latest
Hmm. It’s hard to know which is official, which ones are going to work (a few of them mention torch versions, but none of them are v1.7.x) and which ones might even be malicious? That’s a worry.
Looking a bit deeper into the Singularity docs I find that one can also use Docker/OCI images, and Singularity can pull them straight from Docker hub. That’s good news, because NVIDIA do maintain official Docker images for using torch with NVIDIA graphics cards (like the 3090), so I find the specific container image for torch v1.7.1 and pull it down with:
singularity pull docker://pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
Running the style_transfer
python script
I’d already cloned the neural style transfer repo, so I can follow the instructions in that README.md
$ singularity shell pytorch_1.7.0-cuda11.0-cudnn8-runtime.sif
Singularity> cd style-transfer-pytorch/
Singularity> pip install --user .
I got a bunch of warnings about certain things not being on the $PATH
, but it
seems to finish installing everthing ok.
Continuing on with the instructions in the README, let’s try running this thing (I’d downloaded a couple of image files to use as my content and style images).
Singularity> style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg
bash: style_transfer: command not found
Hmm, looks like those $PATH
warnings were prescient. Looking back, the exact
warning was:
WARNING: The script normalizer is installed in '/home/users/ben/.local/bin' which is not on PATH
The quickest & dirtiest fix for this is to add that /bin
directory to my path
and try and re-run the script.
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg
And away it went! Several minutes later, it was done. Here are the original two images:
and here’s the output:
Success…ish. Clearly I need to keep tweaking parameters & input images to come up with an output that’s actually good, but at least that journey can now begin.
But is it fast?
Actually, that declaration of success is a bit premature. At the top of the output I noticed that the script was running on the CPU, not the GPU.
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg
~/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from https://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Using devices: cpu
CPU threads: 128
Loading model...
That’s really not ok—the whole point of running on this machine is to take
advantage of the GPUs. There could be lots of reasons for this, but I have a
hunch it has something to do with Singularity not allowing the script access to
the hardware. Sure enough, looking through the Singularity GPU support
documentation it turns out
there’s a magic --nv
flag which must be passed when starting up the
Singularity session, so let’s do that.
$ singularity shell --nv pytorch_1.7.1-cuda11.0-cudnn8-runtime.sif
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg
Using devices: cuda:0
~/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:143: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
GPU 0 type: NVIDIA GeForce RTX 3090 (compute 8.6)
GPU 0 RAM: 24268 MB
Loading model...
*error traceback intensifies*
Well, that’s progress. Looking through the output I can see
Using devices: cuda:0
GPU 0 type: NVIDIA GeForce RTX 3090 (compute 8.6)
GPU 0 RAM: 24268 MB
so torch can now see the GPUs. However, the error message in the middle of that output is now the problem:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
Like I said earlier, torch/tensorflow/CUDA and deep learning frameworks in general are really finnicky about versions. It’s tricky to get things up and running so that (i) all the versions work together and (ii) the changes you make don’t break the delicate version relationships between other deep learning projects you want to run on the same system1.
After a web search, it seems like others have had similar issues, although I tried all the approaches listed there and none of them worked.
Using a pytorch image from NVIDIA’s container registry
Changing tack a bit (after a suggestion from a colleague) I decided to try using a (Docker) container image from the NVIDIA registry, rather than the official pytorch channel on Docker Hub.
$ singularity pull docker://nvcr.io/nvidia/pytorch:22.01-py3
$ singularity shell --nv pytorch_22.01-py3.sif
Singularity> pip install --user .
Now, let’s try running the style_trasfer
script one more time:
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg
Using devices: cuda:0
GPU 0 type: NVIDIA GeForce RTX 3090 (compute 8.6)
GPU 0 RAM: 24268 MB
Loading model...
Processing content image (128x85)...
Hooray! It works, and runs, like 10000x faster on the GPU.
Open questions
I really was just “hacking it until it worked” during this process, so I have a few open questions.
-
What’s the “persistance” story with the singularity images (
*.sif
files)? Is it like docker, where Isingularity shell
in, do some things, but then any changes I make in the shell (container?) don’t persist? It doesn’t seem like that… but need to have a better mental model of how singularity images work. -
I didn’t use venvs or conda or poetry or any of the things I’d usually use when python-ing on my own machine, partially because of my above questions about how the whole singularity shell thing actually works. I just did
pip install --user .
and hoped it didn’t break anything else. Is that ok? Or should I still use venvs in the singularity image?
I will return and try and better understand these things later, but right now this isn’t on the critical path for me so I’ll have to park it. This blog post is really just me opening a ticket for myself to return to later. I share it so that you, dear reader, can also benefit from my mistakes (and if you know of better ways to do any of this then do drop me a line.
Footnotes
-
I had hoped that Singularity might help with the “isolation” part of this, but I’m not sure I understand it well enough yet to know how to do it. ↩