01 Feb '22
Running an AI neural style transfer model under Singularity
I’ve recently been given access to a beefy AI server (6x RTX3090s!) which is managed via SingularityCE, whose homepage boldly asks and then forgets to answer the question: “What is SingularityCE?”
If you dig further into the documentation it’s a little less coy:
SingularityCE is a container platform. It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible. You can build a container using SingularityCE on your laptop, and then run it on many of the largest HPC clusters in the world, local university or company clusters, a single server, in the cloud, or on a workstation down the hall. Your container is a single file, and you don’t have to worry about how to install all the software you need on each different operating system.
I want to fire up my new GPUs and run one of Katherine Crowson’s awesome pytorch scripts to do some neural style transfer. I’m very familiar with Docker, but new to this Singularity thing, so here are some of the hurdles I encountered (and cleared) along the way.
Finding a base image
Looking in the style transfer repo’s
it looks like torch v1.7.1 or later is required. Having
done this sort of thing before, I know that these deep learning frameworks
change a fair bit even between minor versions, so the safest option is to pick
the exact version that it was designed for—in this case v1.7.1 (or at least
So, the challenge is to find a Singularity image with that version of torch installed. The singularity docs suggest using the search command like so:
$ singularity search torch Found 34 container images for amd64 matching "torch": library://adalisan81/default/pytorch:latest library://aday651/default/pytorch-geometric-gpu:latest library://aphoh/default/pytorch-20.11-py3:v-1 library://aradeva24/default/ar_pytorch_21.06-py3.sif:latest PyTorch NGC container with CUDA11.0, where PyTorch and apex are installed library://calebh/hpccm-test/faircluster-pytorch-1.10-cuda11.3:sha256.7c63a6c1f6f125b8d3e14fa10203965536ec7173d50e85b8c9ecf6ee0bff2ba7 library://claytonm/default/ubuntu18_torch_torchvision_opencv_cuda10:latest library://dxtr/default/hpc-pytorch:0.1 library://guoweihe/default/pytorch:hz1 library://guoweihe/default/pytorch:v1.2 library://guoweihe/default/torch:deep-ed library://guoweihe/default/torch:sha256.ff32c85ade2c8f6a1d34bd500de1b7bd11cdac16461aeef4d7cbd16ab129d8a7 library://guoweihe/default/torchgpipe:master library://guoweihe/default/torchgpipe:sha256.a6ea5d732cba07c043e2f06cccbe541d28da6a8d9e5a3d18872d58af288dbc62 library://ipa/medimgproc/pytorch:latest library://jamiechang917/default/pytorch:sha256.9c60c9825f20626cc0d6e69ac61d862bfec927e82d86becee73f853d657f2425 library://lamastex/default/pytorch_21.03.sif:berzelius-20211027 library://lamastex/default/pytorch_21.07.sif:berzelius-20211027 library://lev-hpc/ml/pytorch_gpu:jupyter library://lev-hpc/ml/torch_tf_jupyter:latest library://mbalazs/default/pytorch:latest library://mbalazs/default/pytorch_cuda110:latest library://mike_holcomb/pytorchvision/v0.1.0:latest,v0.1.0 library://oscartang/default/pytorch_translation:latest library://ottino8/default/pytorch:first library://pauldhein/hpc-deep-learning/torch-base:latest library://sina-ehsani/default/transformer-googlecrawl-torch-opencv:latest library://sina-ehsani/default/transformer-googlecrawl-torch1.10:latest library://skykiny/default/pytorch_skykiny:latest library://tmyoda/default/cuda-torch-pyenv:latest library://tru/default/c7-conda-pytorch-10.0:2019-07-12-2053,latest library://ufscar/hpc/cuda_pytorch:latest library://uvarc/default/pytorch:1.4.0-py37 library://yboget/default/pytorch_rdkit_visdom:sha256.d97f221ef1294a8ef57d40cf7994d05d4955abc7cf39e3ce42faafd59fe3151a library://zhengtang/default/torch_translation:latest
Hmm. It’s hard to know which is official, which ones are going to work (a few of them mention torch versions, but none of them are v1.7.x) and which ones might even be malicious? That’s a worry.
Looking a bit deeper into the Singularity docs I find that one can also use Docker/OCI images, and Singularity can pull them straight from Docker hub. That’s good news, because NVIDIA do maintain official Docker images for using torch with NVIDIA graphics cards (like the 3090), so I find the specific container image for torch v1.7.1 and pull it down with:
singularity pull docker://pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
style_transfer python script
I’d already cloned the neural style transfer repo, so I can follow the instructions in that README.md
$ singularity shell pytorch_1.7.0-cuda11.0-cudnn8-runtime.sif Singularity> cd style-transfer-pytorch/ Singularity> pip install --user .
I got a bunch of warnings about certain things not being on the
$PATH, but it
seems to finish installing everthing ok.
Continuing on with the instructions in the README, let’s try running this thing (I’d downloaded a couple of image files to use as my content and style images).
Singularity> style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg bash: style_transfer: command not found
Hmm, looks like those
$PATH warnings were prescient. Looking back, the exact
WARNING: The script normalizer is installed in '/home/users/ben/.local/bin' which is not on PATH
The quickest & dirtiest fix for this is to add that
/bin directory to my path
and try and re-run the script.
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg
And away it went! Several minutes later, it was done. Here are the original two images:
and here’s the output:
Success…ish. Clearly I need to keep tweaking parameters & input images to come up with an output that’s actually good, but at least that journey can now begin.
But is it fast?
Actually, that declaration of success is a bit premature. At the top of the output I noticed that the script was running on the CPU, not the GPU.
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg ~/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 Using devices: cpu CPU threads: 128 Loading model...
That’s really not ok—the whole point of running on this machine is to take
advantage of the GPUs. There could be lots of reasons for this, but I have a
hunch it has something to do with Singularity not allowing the script access to
the hardware. Sure enough, looking through the Singularity GPU support
documentation it turns out
there’s a magic
--nv flag which must be passed when starting up the
Singularity session, so let’s do that.
$ singularity shell --nv pytorch_1.7.1-cuda11.0-cudnn8-runtime.sif Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg Using devices: cuda:0 ~/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:143: UserWarning: NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) GPU 0 type: NVIDIA GeForce RTX 3090 (compute 8.6) GPU 0 RAM: 24268 MB Loading model... *error traceback intensifies*
Well, that’s progress. Looking through the output I can see
Using devices: cuda:0 GPU 0 type: NVIDIA GeForce RTX 3090 (compute 8.6) GPU 0 RAM: 24268 MB
so torch can now see the GPUs. However, the error message in the middle of that output is now the problem:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
Like I said earlier, torch/tensorflow/CUDA and deep learning frameworks in general are really finnicky about versions. It’s tricky to get things up and running so that (i) all the versions work together and (ii) the changes you make don’t break the delicate version relationships between other deep learning projects you want to run on the same system1.
Using a pytorch image from NVIDIA’s container registry
Changing tack a bit (after a suggestion from a colleague) I decided to try using a (Docker) container image from the NVIDIA registry, rather than the official pytorch channel on Docker Hub.
$ singularity pull docker://nvcr.io/nvidia/pytorch:22.01-py3 $ singularity shell --nv pytorch_22.01-py3.sif Singularity> pip install --user .
Now, let’s try running the
style_trasfer script one more time:
Singularity> PATH="$PATH:~/.local/bin" style_transfer ben.jpg tiger.jpg -o ben-tiger.jpg Using devices: cuda:0 GPU 0 type: NVIDIA GeForce RTX 3090 (compute 8.6) GPU 0 RAM: 24268 MB Loading model... Processing content image (128x85)...
Hooray! It works, and runs, like 10000x faster on the GPU.
I really was just “hacking it until it worked” during this process, so I have a few open questions.
What’s the “persistance” story with the singularity images (
*.siffiles)? Is it like docker, where I
singularity shellin, do some things, but then any changes I make in the shell (container?) don’t persist? It doesn’t seem like that… but need to have a better mental model of how singularity images work.
I didn’t use venvs or conda or poetry or any of the things I’d usually use when python-ing on my own machine, partially because of my above questions about how the whole singularity shell thing actually works. I just did
pip install --user .and hoped it didn’t break anything else. Is that ok? Or should I still use venvs in the singularity image?
I will return and try and better understand these things later, but right now this isn’t on the critical path for me so I’ll have to park it. This blog post is really just me opening a ticket for myself to return to later. I share it so that you, dear reader, can also benefit from my mistakes (and if you know of better ways to do any of this then do drop me a line.
I had hoped that Singularity might help with the “isolation” part of this, but I’m not sure I understand it well enough yet to know how to do it. ↩