Chinese Yellow Pages | Classifieds | Knowledge | Tax | IME

Ollama is a front-end written in go, and wrap-up the back-end of llama.cpp

Here is the steps for setup ollama in k8s cluster

(1) write a k8s yaml file,

where

using ollama docker image ( ollama/ollama ),

and put into k8s with persistent storage for model ( mount as /root/.ollama)

expose ollama 11434 as service (load balancer)

(2)  login to ollama, pull models

kubectl exec -it comrite-web-ollama-8676669bb-jvx2z — /bin/bash

nohup ollama pull llama2 &

nohup ollama pull codellama &

we do not need to run ollama run llama2 inside container as it started ollama serve already, it can load model based on request

(3) test:

curl http://192.168.86.88:11434/api/generate -d ‘{
“model”: “llama2”,
“prompt”:”Why is the sky blue?”
}’

curl -X POST http://192.168.86.88:11434/api/generate -d ‘{
“model”: “codellama”,
“prompt”: “Write me a function that outputs the fibonacci sequence”
}’

can pass the parameter to disable streamming

“stream”: false

Leave a Reply

Your email address will not be published. Required fields are marked *