Using ollama on the Minnesota HPC cluster

September 22, 2024

Installing ollama #

You probably do not have sufficient right to write to /usr/, so the alternative is to download the binary here and to install somewhere where you can find it.¹

You should now be able to start ollama; as a check you can look at your version number

$HOME/local/ollama/bin/ollama --version
# Warning: could not connect to a running Ollama instance
# Warning: client version is 0.3.8

Details come from the official linux installation webpage

Setting up and downloading a model #

Right now this is just an executable but we do not have any model available to us, so we are going to go through a few more steps for setup.

Setting up #

First models can be fairly large, and you might not want them to reside in your home partition which has limited space. We are going to change a few of the ollama environment files to setup where the models will be downloaded to. The official doc suggests using systemctl edit ollama.service but I could not find a way to make this work (it modifies a file under /etc/systemd/ which is not writable for us). I resort to setting environment variables by hand in bash; I found a list of the variables here

The one environment variable of interest in here is OLLAMA_MODELS (for now, we will use some of the other ones later).

Downlading and “running” a first model #

We are going to store the models on the scratch space on the server which has very little limitation (no backup though, be careful). In my example I choose /scratch.global/$USERNAME/llm/ollama (to make sure it exists simply create it mkdir -p /scratch.global/$USERNAME/llm/ollama)

We are ready to start the ollama server instance; as we start the server we also pass our environment variable

$ OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama serve &

The & makes the command run in the background.

First, you can check whether you have any models installed on your machine by running ollama list but if this is your first time doing this, this should be empty.

To run your first model, you need to download it first. Luckily, this is automatic once you ask ollama to run a model as follows: Pick of model of your choice from here and you are almost good to go; in this example I choose mistral-nemo

$ ollama run mistral-nemo

You should see it download the model, before it starts running. Chances are (depending on your node configuration), that the model is either not able to run or excruciatingly slow. This is because we have not setup any GPU … this is the next section.

Before we move on, we still need to clean up as the server is still running in the background. To do this we can simply find the id of the ollama server and then kill it:

$ pgrep ollama
951128 # this is the id of the pricess
$ kill 951128

You can check if the server is still running with ollama list.

Running a model properly #

Requesting an appropriate node #

For the model to run somewhat smoothly, we need to provision some GPU resources on the cluster. This will be different on each HPC resources you have access to, but for us at MSI the command will be something like this:

salloc --time=2:00:00 --nodes=1 --ntasks=16 --partition=interactive-gpu --gres=gpu:a40:2 --tmp=12gb --mem=64g

This is a request for 2 hours of walltime, 1 node, 16 cpu cores, a node in one of the gpu-equipped partition (here named interactive-gpu), 2 a40 GPUs, 12gb of tmp space, and 64gb of RAM. Note that what matters here is the VRAM (not the CPU RAM); when we request a gpu allocation, we want it to be commensurate with the model we are interested in: broadly, the model needs to fit in the GPUs VRAM.

To figure out the VRAM available to your GPU, there are multiple lists available (this one for NVIDIA seems fine). So in my case, I have requested 2 A40 GPUS which adds up to 48 x 2 = 96gb of VRAM!

I assume that you are running the ollama server in the background (we will get back to this), such that you can run

$ ollama list
NAME                    ID              SIZE    MODIFIED
mistral-nemo:latest     994f3b8b7801    7.1 GB  XX minutes ago

mistral-nemo is a small model so it will fit within 2 A40 without any problem. Of course the goal here is to use larger models, which will happen later.

Running ollama with a GPU #

When we start the ollama server, it should automatically recognize your gpu. It has happened to me in the past, that the GPUs where not recognized, but there is an easy fix! To know whether your GPU has been recognized by the server, you can run the server in the foreground and look at the log. Here is what I have on my end

$ OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama serve
XXXX # some log of the server as it starts
time=2024-09-21T11:24:05.564-05:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
time=2024-09-21T11:24:06.618-05:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-955e21e4-b472-f85c-cb8b-ccad7509910f library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA A40" total="44.3 GiB" available="44.1 GiB"
time=2024-09-21T11:24:06.618-05:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ef5223e4-8d47-cd57-a1ec-d60640467243 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA A40" total="44.3 GiB" available="44.1 GiB"

So it looks like it has found my GPUs.

If it has not, you will notice right away as querying the llm will be excruciatingly slow (low token/s). In this case, you have to hardwire into the environment variables the IDs of the GPUs; how do you find these ids? There is a command for this in the case of NVIDIA’s GPUs:

nvidia-smi -L
GPU 0: NVIDIA A40 (UUID: GPU-955e21e4-b472-f85c-cb8b-ccad7509910f)
GPU 1: NVIDIA A40 (UUID: GPU-ef5223e4-8d47-cd57-a1ec-d60640467243)

Thus you can copy the two unique ids (UUID) and pass them as environment variable when you start the server.

Kill the existing server and restart it with proper environment variables (adding CUDA_VISIBLE_DEVICES)

$ OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama CUDA_VISIBLE_DEVICES=GPU-955e21e4-b472-f85c-cb8b-ccad7509910f,GPU-ef5223e4-8d47-cd57-a1ec-d60640467243 ollama serve

Then on the other side, use ollama run mistral-nemo to run your model and you are good to go.

Pulling weights and “compiling” them into an ollama model #

Now let’s imagine, you are interested in a specialized model that you have seen on hugging face, and you want to run it locally. I will take the example of a medium-sized model here from hugging face Pull the model (this requires having access to the git large file system utility, I install it from source and then I add it to the path²):

$ cd /scratch.global/$USERNAME/llm/
# you will have to create an access token on hugging face to log-in
$ git clone https://huggingface.co/mistralai/Mistral-Small-Instruct-2409

Then we need to build the model into a form that is compatible with ollama (we make sure to specify a proper environment for OLLAMA_TMP_DIR or else it will build the model under /tmp where we only asked for 12gb of tmp space). There are two steps (official doc): creating a model file, and then creating the model from the model file. We start by creating the modelfile which we will name ReflectionModel

$ cd /scratch.global/$USERNAME/llm/
$ cat << 'EOF' > Modelfile
FROM /scratch.global/$USERNAME/llm/Mistral-Small-Instruct-2409
PARAMETER temperature 1
SYSTEM """
You are an AI assistant designed to be helpful, harmless, and honest. Your primary goal is to engage in friendly conversation and assist users with a wide range of tasks and questions.
Guidelines:
1. Be polite and respectful at all times.
2. Provide clear and concise answers.
3. If you're unsure about something, admit it and offer to find more information.
4. Use appropriate language and maintain a friendly tone.
5. Respect user privacy and don't ask for personal information.
6. Offer follow-up questions or suggestions to keep the conversation flowing naturally.
Remember, you are an AI language model and should not pretend to have human experiences or emotions. If asked about your capabilities or limitations, be honest and straightforward.
"""
EOF

Make sure to adjust your path as it is on your own machine, same with the folder where the model was downloaded. Obviously you can also adjust the system prompt.

To create the model we are going to first start the server with a few options to make sure the temporary files from setting up the model do not get written to the default /tmp but to another a directory we control (I think this is the TMPDIR variable). On my end it looks like the following (first environment variable, then ollama command):

Starting the server

$ mkdir -p /scratch.global/$USERNAME/llm/tmp # create a tmp dir to buffer model creation
$ TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_RUNNERS_DIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama serve

$ cd /scratch.global/$USERNAME/llm # this is where we have created our Modelfile
$ TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_RUNNERS_DIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama create Mistral-Small-Instruct-2409

This might be a while (especially given our large scale example)… so be patient… This is the output I get upon success:

transferring model data 100%
converting model
creating new layer sha256:abe736818cb4883a504b56fe77001810d62c3e59074782316f4fe7f3b9f6b7e8
creating new layer sha256:74a694aa0d803257176eb4326cf255021afef1d8bff656e8a93998636eba2269
creating new layer sha256:d8ba2f9a17b3bbdeb5690efaa409b3fcb0b56296a777c7a69c78aa33bbddf182
creating new layer sha256:de7dae010413cc23ca780e05dafe5eb56e5919d87fe595304b45da3211bfe9b3
writing manifest
success

I can check that my model has been properly added by listing all the installed models:

$ ollama list
NAME                              	ID          	SIZE  	MODIFIED
Mistral-Small-Instruct-2409:latest	42b2ae701882	44 GB 	11 hours ago
mistral-nemo:latest               	994f3b8b7801	7.1 GB	21 hours ago

Here it turns out that the model I downloaded did not quite fit in memory, so I had to go back and request an allocation with 4 A40 GPUs. Then I can simply load the model (do not forget to check if the server is running with all the environment variables being set)

ollama run Mistral-Small-Instruct-2409

Do not hesitate to reach out for suggestions or tweaks!

You can also use spack for isntalling ollama. ↩︎

I installed git-lfs using spack:

spack install git-lfs # install it in general environment
spack load git-lfs    # make it available in current environment

↩︎