Using ollama on the Minnesota HPC cluster
September 22, 2024
Installing ollama #
You probably do not have sufficient right to write to /usr/
, so the alternative is to download the binary
here and to install somewhere where you can find it.1
You should now be able to start ollama; as a check you can look at your version number
$HOME/local/ollama/bin/ollama --version
# Warning: could not connect to a running Ollama instance
# Warning: client version is 0.3.8
Details come from the official linux installation webpage
Setting up and downloading a model #
Right now this is just an executable but we do not have any model available to us, so we are going to go through a few more steps for setup.
Setting up #
First models can be fairly large, and you might not want them to reside in your home partition which has limited space.
We are going to change a few of the ollama environment files to setup where the models will be downloaded to.
The
official doc suggests using systemctl edit ollama.service
but I could not find a way to make this work (it modifies a file under /etc/systemd/
which is not writable for us).
I resort to setting environment variables by hand in bash; I found a list of the variables
here
The one environment variable of interest in here is OLLAMA_MODELS
(for now, we will use some of the other ones later).
Downlading and “running” a first model #
We are going to store the models on the scratch space on the server which has very little limitation (no backup though, be careful).
In my example I choose /scratch.global/$USERNAME/llm/ollama
(to make sure it exists simply create it mkdir -p /scratch.global/$USERNAME/llm/ollama
)
We are ready to start the ollama server instance; as we start the server we also pass our environment variable
$ OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama serve &
The &
makes the command run in the background.
First, you can check whether you have any models installed on your machine by running ollama list
but if this is your first time doing this, this should be empty.
To run your first model, you need to download it first.
Luckily, this is automatic once you ask ollama to run a model as follows:
Pick of model of your choice from
here and you are almost good to go; in this example I choose mistral-nemo
$ ollama run mistral-nemo
You should see it download the model, before it starts running. Chances are (depending on your node configuration), that the model is either not able to run or excruciatingly slow. This is because we have not setup any GPU … this is the next section.
Before we move on, we still need to clean up as the server is still running in the background.
To do this we can simply find the id
of the ollama server and then kill it:
$ pgrep ollama
951128 # this is the id of the pricess
$ kill 951128
You can check if the server is still running with ollama list
.
Running a model properly #
Requesting an appropriate node #
For the model to run somewhat smoothly, we need to provision some GPU resources on the cluster. This will be different on each HPC resources you have access to, but for us at MSI the command will be something like this:
salloc --time=2:00:00 --nodes=1 --ntasks=16 --partition=interactive-gpu --gres=gpu:a40:2 --tmp=12gb --mem=64g
This is a request for 2 hours of walltime, 1 node, 16 cpu cores, a node in one of the gpu-equipped partition (here named interactive-gpu
), 2 a40 GPUs, 12gb of tmp space, and 64gb of RAM.
Note that what matters here is the VRAM (not the CPU RAM); when we request a gpu allocation, we want it to be commensurate with the model we are interested in: broadly, the model needs to fit in the GPUs VRAM.
To figure out the VRAM available to your GPU, there are multiple lists available (this
one for NVIDIA seems fine).
So in my case, I have requested 2 A40 GPUS which adds up to 48 x 2 = 96gb
of VRAM!
I assume that you are running the ollama server in the background (we will get back to this), such that you can run
$ ollama list
NAME ID SIZE MODIFIED
mistral-nemo:latest 994f3b8b7801 7.1 GB XX minutes ago
mistral-nemo
is a small model so it will fit within 2 A40 without any problem.
Of course the goal here is to use larger models, which will happen later.
Running ollama with a GPU #
When we start the ollama server, it should automatically recognize your gpu. It has happened to me in the past, that the GPUs where not recognized, but there is an easy fix! To know whether your GPU has been recognized by the server, you can run the server in the foreground and look at the log. Here is what I have on my end
$ OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama serve
XXXX # some log of the server as it starts
time=2024-09-21T11:24:05.564-05:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
time=2024-09-21T11:24:06.618-05:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-955e21e4-b472-f85c-cb8b-ccad7509910f library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA A40" total="44.3 GiB" available="44.1 GiB"
time=2024-09-21T11:24:06.618-05:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ef5223e4-8d47-cd57-a1ec-d60640467243 library=cuda variant=v12 compute=8.6 driver=12.4 name="NVIDIA A40" total="44.3 GiB" available="44.1 GiB"
So it looks like it has found my GPUs.
If it has not, you will notice right away as querying the llm will be excruciatingly slow (low token/s). In this case, you have to hardwire into the environment variables the IDs of the GPUs; how do you find these ids? There is a command for this in the case of NVIDIA’s GPUs:
nvidia-smi -L
GPU 0: NVIDIA A40 (UUID: GPU-955e21e4-b472-f85c-cb8b-ccad7509910f)
GPU 1: NVIDIA A40 (UUID: GPU-ef5223e4-8d47-cd57-a1ec-d60640467243)
Thus you can copy the two unique ids (UUID) and pass them as environment variable when you start the server.
Kill the existing server and restart it with proper environment variables (adding CUDA_VISIBLE_DEVICES
)
$ OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama CUDA_VISIBLE_DEVICES=GPU-955e21e4-b472-f85c-cb8b-ccad7509910f,GPU-ef5223e4-8d47-cd57-a1ec-d60640467243 ollama serve
Then on the other side, use ollama run mistral-nemo
to run your model and you are good to go.
Pulling weights and “compiling” them into an ollama model #
Now let’s imagine, you are interested in a specialized model that you have seen on hugging face, and you want to run it locally. I will take the example of a medium-sized model here from hugging face Pull the model (this requires having access to the git large file system utility, I install it from source and then I add it to the path2):
$ cd /scratch.global/$USERNAME/llm/
# you will have to create an access token on hugging face to log-in
$ git clone https://huggingface.co/mistralai/Mistral-Small-Instruct-2409
Then we need to build the model into a form that is compatible with ollama (we make sure to specify a proper environment for OLLAMA_TMP_DIR or else it will build the model under /tmp where we only asked for 12gb of tmp space). There are two steps (official doc): creating a model file, and then creating the model from the model file. We start by creating the modelfile which we will name ReflectionModel
$ cd /scratch.global/$USERNAME/llm/
$ cat << 'EOF' > Modelfile
FROM /scratch.global/$USERNAME/llm/Mistral-Small-Instruct-2409
PARAMETER temperature 1
SYSTEM """
You are an AI assistant designed to be helpful, harmless, and honest. Your primary goal is to engage in friendly conversation and assist users with a wide range of tasks and questions.
Guidelines:
1. Be polite and respectful at all times.
2. Provide clear and concise answers.
3. If you're unsure about something, admit it and offer to find more information.
4. Use appropriate language and maintain a friendly tone.
5. Respect user privacy and don't ask for personal information.
6. Offer follow-up questions or suggestions to keep the conversation flowing naturally.
Remember, you are an AI language model and should not pretend to have human experiences or emotions. If asked about your capabilities or limitations, be honest and straightforward.
"""
EOF
Make sure to adjust your path as it is on your own machine, same with the folder where the model was downloaded. Obviously you can also adjust the system prompt.
To create the model we are going to first start the server with a few options to make sure the temporary files from setting up the model do not get written to the default /tmp
but to another a directory we control (I think this is the TMPDIR
variable).
On my end it looks like the following (first environment variable, then ollama command):
Starting the server
$ mkdir -p /scratch.global/$USERNAME/llm/tmp # create a tmp dir to buffer model creation
$ TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_RUNNERS_DIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama serve
$ cd /scratch.global/$USERNAME/llm # this is where we have created our Modelfile
$ TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_TMPDIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_RUNNERS_DIR=/scratch.global/$USERNAME/llm/tmp OLLAMA_MODELS=/scratch.global/$USERNAME/llm/ollama ollama create Mistral-Small-Instruct-2409
This might be a while (especially given our large scale example)… so be patient… This is the output I get upon success:
transferring model data 100%
converting model
creating new layer sha256:abe736818cb4883a504b56fe77001810d62c3e59074782316f4fe7f3b9f6b7e8
creating new layer sha256:74a694aa0d803257176eb4326cf255021afef1d8bff656e8a93998636eba2269
creating new layer sha256:d8ba2f9a17b3bbdeb5690efaa409b3fcb0b56296a777c7a69c78aa33bbddf182
creating new layer sha256:de7dae010413cc23ca780e05dafe5eb56e5919d87fe595304b45da3211bfe9b3
writing manifest
success
I can check that my model has been properly added by listing all the installed models:
$ ollama list
NAME ID SIZE MODIFIED
Mistral-Small-Instruct-2409:latest 42b2ae701882 44 GB 11 hours ago
mistral-nemo:latest 994f3b8b7801 7.1 GB 21 hours ago
Here it turns out that the model I downloaded did not quite fit in memory, so I had to go back and request an allocation with 4 A40 GPUs. Then I can simply load the model (do not forget to check if the server is running with all the environment variables being set)
ollama run Mistral-Small-Instruct-2409
Do not hesitate to reach out for suggestions or tweaks!